Pages:
Author

Topic: Viewing unedited posts and deleted posts, view per post, per user or per topic - page 7. (Read 8820 times)

legendary
Activity: 2422
Merit: 1451
Leading Crypto Sports Betting & Casino Platform
I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing.
I should have guessed this. It must be the easiest way to scrape anyway...

Quote the post and then click on preview. You will see that the post is shown correctly. But it doesn't work when it is posted. It says "INVALID BBCODE: close of unopened tag in table (1)"
Interesting how you can quote the post to see contents. I know this is a scenario that's too specific but I'll post my two cents anyway.

Contents of posts otherwise invisible due including a table with broken tags are accessible to any forum member able to quote the post, but invisible in the eyes of robots. I don't see any utility for any poster to do this to their posts intentionally. If they can edit their threads contents could be replaced with something like a dot and be done with it.

But it could be that a few thousands of such posts exist. Google gives out 3100 results when you google ("INVALID BBCODE: close of unopened tag in table" site:bitcointalk.org), some duplicates and some coming from signatures of course.
2550 results if you remove two users that came up with broken sugnatures ("INVALID BBCODE: close of unopened tag in table" site:bitcointalk.org -Gamesbuy -trinaldao)

Now, I'm stepping into territory of a sub-case in a sub-case, but if posts with broken bbcode are still unedited and quotable, then their contents could be salvaged. Could that be worth pursuing? Probably not. But strictly speaking it should be done if you'd want to grab everything that's available.
legendary
Activity: 2380
Merit: 5213
I see here that a thread was captured with just the error message for wrong BBcode:
http://loyce.club/archive/posts/5395/53951381.html

Any way to include the content in spite of wrong BBcode?
I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing.

Yes, that's an error (maybe a bug) from the forum.

That archived post was like the following post.

https://bitcointalksearch.org/topic/bounty-qravity-ico-4337249

Quote the post and then click on preview. You will see that the post is shown correctly. But it doesn't work when it is posted. It says "INVALID BBCODE: close of unopened tag in table (1)"
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I see here that a thread was captured with just the error message for wrong BBcode:
http://loyce.club/archive/posts/5395/53951381.html

Any way to include the content in spite of wrong BBcode?
I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing.
legendary
Activity: 2422
Merit: 1451
Leading Crypto Sports Betting & Casino Platform
I see here that a thread was captured with just the error message for wrong BBcode:
http://loyce.club/archive/posts/5395/53951381.html

Any way to include the content in spite of wrong BBcode?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
How to use it
  • Find the msgID, userID or topicID you need. Let's use msgID 51902990.
  • Remove the last 4 digits from the msgID to get the directory name (if there are less than 4 digits, use 0): 5190.
  • Put everything together behind the (above) URL and add ".html": http://loyce.club/archive/posts/5190/51902990.html.
This is an example of how I use it in practice: I copy the topicID (5229466), then type "topics" on my URL-bar. My browser suggests http://loyce.club/archive/topics/, which I select. Then, I paste the topicID, hit Backspace 4 times, type "/", paste the topicID again, type ".html" and hit enter. It takes some getting used to, but in just 5 seconds I have the page I was looking for: http://loyce.club/archive/topics/522/5229466.html.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
copper member
Activity: 1554
Merit: 489
Stop the war!
I now have the first 6.1 million Bitcointalk posts archived. Data processing took longer than expected, but it's published now. See link above.

38 million left.
How fast does your parsing work?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
If it is not a secret, how much data space is needed for all that millions of posts?
I'm currenly using 54 GB for loyce.club, and store 4.2 million files.

Viewing unedited/deleted posts

How to use it
  • Find the msgID, userID or topicID you need. Let's use msgID 51902990.
  • Remove the last 4 digits from the msgID to get the directory name (if there are less than 4 digits, use 0): 5190.
  • Put everything together behind the (above) URL and add ".html": http://loyce.club/archive/posts/5190/51902990.html.

Details
  • Files are stored with their msgID, userID or topicID as file name. I remove the last 4 digits to create the directory name. Each directory contains up to 10,000 HTML-files. Use CTRL-F to find what you're looking for.
  • I don't scrape hidden boards (such as Investigations).
  • I don't keep post titles
  • I save raw HTML, including quotes
  • If I run out of disk space, I might create compressed archives per 10,000 posts.
  • Although I plan to preserve all data, I make no guarantees. Feel free to archive posts.
  • My current (sponsored) webhost has enough storage space for years to come.
  • All scrape-times use Amsterdam time (CET).
  • Usually, I capture at least 99.95% of all posts. Server or internet connection problems can severely reduce this.

Examples
legendary
Activity: 2212
Merit: 7064
If it is not a secret, how much data space is needed for all that millions of posts?
And is there a way to use some compression?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Update: I finally got to work on a topic-view. It'll be published at http://loyce.club/archive/topics/, and it's currently crunching data on 1.9 million posts. I'm downloading the thread titles that I don't have yet, and that part takes a lot of time.

This viewer should make it much easier to find back all posts made in a certain thread. Obviously this is also limited to posts made after I started scraping.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Bump!

I now have the first 6.1 million posts, I'm currently still processing them for publication. Will be updated soon.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I now have the first 6.1 million Bitcointalk posts archived. Data processing took longer than expected, but it's published now. See link above.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
LoyceV, are you saving the raw data, or just converting it?  
I'm only saving the raw post (one line from the raw HTML), and my own header like this one (post number, link to post, link to my archive, by username and scraping time).

Yes, I'm all for it!
It's started already Cheesy
legendary
Activity: 3010
Merit: 8114
I've been thinking about expanding my archived posts to all posts that haven't been deleted yet. It would be useful for a case like this. It requires scraping a couple million pages, and storing 50+ million posts. I can limit the number of files on the server by storing 10 or 100 posts per page. Would this be useful?

Yes, I'm all for it!
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
LoyceV, are you saving the raw data, or just converting it? 
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Something failed in the above scrape, the Wall Observer thread stopped scraping after page 2628. I thought I had the "last page detection" working, but there still seems to be a flaw.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I've been thinking about expanding my archived posts to all posts that haven't been deleted yet.
An update: I have started this project! Measured in scraping time, it's the biggest project I ever started. In the past 9 days, I've scraped about 4% of all data, so I expect to complete this around August.
There's also a chance I'll run out of disk space because of the millions of large posts made by bounty spammers, but I'll deal with that when it happens.

Sneak preview: http://loyce.club/archive/oldposts/
How to use:
  • Find the msgID you need. Let's use 28228
  • Remove the last 5 digits from the msgID to get the directory name (if there are less than 5 digits, use 0): 0
  • Replace the last 2 digits of the msgID by xx, and add .html (if there are less than 5 digits, use 0xx): 282xx.html
  • Add "#msg" and the msgID: #msg28228
  • Put everything together and go to http://loyce.club/archive/oldposts/0/282xx.html#msg28228

Limitations
  • Currently, the first 2.1 million posts are available.
  • I'll scrape the first 5.21 million topics and all posts in there.
  • That means I'll archive 53.36 million posts, this partially overlaps with my scraper for new posts.
  • This is a one-time thing, I won't update it with newer posts (I scrape unedited versions for those).
  • The time "scraped on" is Amsterdam time.

If no username is mentioned, it's either "Anonymous" or "random". I forgot those exist when I started scraping, and it's not important enough to start over.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I've been thinking about expanding my archived posts to all posts that haven't been deleted yet. It would be useful for a case like this. It requires scraping a couple million pages, and storing 50+ million posts. I can limit the number of files on the server by storing 10 or 100 posts per page. Would this be useful?
Pages:
Jump to: