Viewing unedited posts and deleted posts, view per post, per user or per topic - page 7.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Update: I now have the first 11 million posts scraped! At the moment, the first 6.1 million are available on loyce.club, processing all data takes approximately 3 days. At the current rate, I'm on schedule to complete archiving all posts around August.

Quote from: LoyceV on February 02, 2020, 01:05:42 PM

Quote from: LoyceV on January 20, 2020, 08:02:51 AM

I've been thinking about expanding my archived posts to all posts that haven't been deleted yet.

An update: I have started this project! Measured in scraping time, it's the biggest project I ever started. In the past 9 days, I've scraped about 4% of all data, so I expect to complete this around August.
There's also a chance I'll run out of disk space because of the millions of large posts made by bounty spammers, but I'll deal with that when it happens.

Sneak preview: http://loyce.club/archive/oldposts/
How to use:

Find the msgID you need. Let's use 28228
Remove the last 5 digits from the msgID to get the directory name (if there are less than 5 digits, use 0): 0
Replace the last 2 digits of the msgID by xx, and add .html (if there are less than 5 digits, use 0xx): 282xx.html
Add "#msg" and the msgID: #msg28228
Put everything together and go to http://loyce.club/archive/oldposts/0/282xx.html#msg28228

Limitations

Currently, the first 2.1 million posts are available.
I'll scrape the first 5.21 million topics and all posts in there.
That means I'll archive 53.36 million posts, this partially overlaps with my scraper for new posts.
This is a one-time thing, I won't update it with newer posts (I scrape unedited versions for those).
The time "scraped on" is Amsterdam time.

If no username is mentioned, it's either "Anonymous" or "random". I forgot those exist when I started scraping, and it's not important enough to start over.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: hosseinimr93 on March 05, 2020, 05:00:50 PM

https://bitcointalksearch.org/topic/bounty-qravity-ico-4337249

Quote the post and then click on preview. You will see that the post is shown correctly. But it doesn't work when it is posted. It says "INVALID BBCODE: close of unopened tag in table (1)"

Maybe it hits the 64 kB limit in HTML, the Russian characters take a lot more space that way. I'm not sure if that's the limit though, I've made posts that take 80 kB when scraped.

Quote from: alani123 on March 05, 2020, 05:36:30 PM

Interesting how you can quote the post to see contents. I know this is a scenario that's too specific but I'll post my two cents anyway.

There are more bug in SMF that cause the preview to show differently than the real post.

Quote

if posts with broken bbcode are still unedited and quotable, then their contents could be salvaged. Could that be worth pursuing?

I only want to archive what the forum shows as public information.

alani123

legendary

Activity: 2422

Merit: 1451

Leading Crypto Sports Betting & Casino Platform

Quote from: LoyceV on March 05, 2020, 12:20:16 PM

I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing.

I should have guessed this. It must be the easiest way to scrape anyway...

Quote from: hosseinimr93 on March 05, 2020, 05:00:50 PM

Quote the post and then click on preview. You will see that the post is shown correctly. But it doesn't work when it is posted. It says "INVALID BBCODE: close of unopened tag in table (1)"

Interesting how you can quote the post to see contents. I know this is a scenario that's too specific but I'll post my two cents anyway.

Contents of posts otherwise invisible due including a table with broken tags are accessible to any forum member able to quote the post, but invisible in the eyes of robots. I don't see any utility for any poster to do this to their posts intentionally. If they can edit their threads contents could be replaced with something like a dot and be done with it.

But it could be that a few thousands of such posts exist. Google gives out 3100 results when you google ("INVALID BBCODE: close of unopened tag in table" site:bitcointalk.org), some duplicates and some coming from signatures of course.
2550 results if you remove two users that came up with broken sugnatures ("INVALID BBCODE: close of unopened tag in table" site:bitcointalk.org -Gamesbuy -trinaldao)

Now, I'm stepping into territory of a sub-case in a sub-case, but if posts with broken bbcode are still unedited and quotable, then their contents could be salvaged. Could that be worth pursuing? Probably not. But strictly speaking it should be done if you'd want to grab everything that's available.

hosseinimr93

legendary

Activity: 2380

Merit: 5213

Quote from: LoyceV on March 05, 2020, 12:20:16 PM

Quote from: alani123 on March 05, 2020, 09:10:01 AM

I see here that a thread was captured with just the error message for wrong BBcode:
http://loyce.club/archive/posts/5395/53951381.html

Any way to include the content in spite of wrong BBcode?

I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing.

Yes, that's an error (maybe a bug) from the forum.

That archived post was like the following post.

https://bitcointalksearch.org/topic/bounty-qravity-ico-4337249

Quote the post and then click on preview. You will see that the post is shown correctly. But it doesn't work when it is posted. It says "INVALID BBCODE: close of unopened tag in table (1)"

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: alani123 on March 05, 2020, 09:10:01 AM

I see here that a thread was captured with just the error message for wrong BBcode:
http://loyce.club/archive/posts/5395/53951381.html

Any way to include the content in spite of wrong BBcode?

I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing.

alani123

legendary

Activity: 2422

Merit: 1451

Leading Crypto Sports Betting & Casino Platform

I see here that a thread was captured with just the error message for wrong BBcode:
http://loyce.club/archive/posts/5395/53951381.html

Any way to include the content in spite of wrong BBcode?

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: LoyceV on July 21, 2019, 12:08:59 PM

How to use it

Find the msgID, userID or topicID you need. Let's use msgID 51902990.
Remove the last 4 digits from the msgID to get the directory name (if there are less than 4 digits, use 0): 5190.
Put everything together behind the (above) URL and add ".html": http://loyce.club/archive/posts/5190/51902990.html.

This is an example of how I use it in practice: I copy the topicID (5229466), then type "topics" on my URL-bar. My browser suggests http://loyce.club/archive/topics/, which I select. Then, I paste the topicID, hit Backspace 4 times, type "/", paste the topicID again, type ".html" and hit enter. It takes some getting used to, but in just 5 seconds I have the page I was looking for: http://loyce.club/archive/topics/522/5229466.html.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: ~DefaultTrust on February 22, 2020, 08:22:26 AM

How fast does your parsing work?

See:

Quote from: LoyceV on February 02, 2020, 01:05:42 PM

I expect to complete this around August.

~DefaultTrust

copper member

Activity: 1554

Merit: 489

Stop the war!

Quote from: LoyceV on February 18, 2020, 03:34:49 PM

Quote from: LoyceV on February 02, 2020, 01:05:42 PM

Sneak preview: http://loyce.club/archive/oldposts/

I now have the first 6.1 million Bitcointalk posts archived. Data processing took longer than expected, but it's published now. See link above.

38 million left.
How fast does your parsing work?

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: dkbit98 on February 21, 2020, 11:30:48 AM

If it is not a secret, how much data space is needed for all that millions of posts?

I'm currenly using 54 GB for loyce.club, and store 4.2 million files.

Quote from: LoyceV on July 21, 2019, 12:08:59 PM

Viewing unedited/deleted posts

See http://loyce.club/archive/posts/ for all posts (Working!)
New posts are archived within seconds after being created, and instantly available.
See http://loyce.club/archive/members/ for posts made by a certain user (Working!)
Updated every 5 minutes.
See http://loyce.club/archive/topics/ for posts made in a certain topic (Working!)
Updated every 5 minutes.

How to use it

Find the msgID, userID or topicID you need. Let's use msgID 51902990.
Remove the last 4 digits from the msgID to get the directory name (if there are less than 4 digits, use 0): 5190.
Put everything together behind the (above) URL and add ".html": http://loyce.club/archive/posts/5190/51902990.html.

Details

Files are stored with their msgID, userID or topicID as file name. I remove the last 4 digits to create the directory name. Each directory contains up to 10,000 HTML-files. Use CTRL-F to find what you're looking for.
I don't scrape hidden boards (such as Investigations).
I don't keep post titles
I save raw HTML, including quotes
If I run out of disk space, I might create compressed archives per 10,000 posts.
Although I plan to preserve all data, I make no guarantees. Feel free to archive posts.
My current (sponsored) webhost has enough storage space for years to come.
All scrape-times use Amsterdam time (CET).
Usually, I capture at least 99.95% of all posts. Server or internet connection problems can severely reduce this.

Examples

The unedited version of this post: http://loyce.club/archive/posts/5190/51902990.html
(the layout looks better in more recent archived posts)
All posts made by me: http://loyce.club/archive/members/45/459836.html
(obviously only since I started archiving posts)
All posts made in this topic: http://loyce.club/archive/topics/516/5167469.html
(the first posts aren't shown because of the slightly different format used for my archive)

dkbit98

legendary

Activity: 2212

Merit: 7064

If it is not a secret, how much data space is needed for all that millions of posts?
And is there a way to use some compression?

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Update: I finally got to work on a topic-view. It'll be published at http://loyce.club/archive/topics/, and it's currently crunching data on 1.9 million posts. I'm downloading the thread titles that I don't have yet, and that part takes a lot of time.

This viewer should make it much easier to find back all posts made in a certain thread. Obviously this is also limited to posts made after I started scraping.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Bump!

Quote from: LoyceV on February 02, 2020, 01:05:42 PM

Sneak preview: http://loyce.club/archive/oldposts/

I now have the first 6.1 million posts, I'm currently still processing them for publication. Will be updated soon.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: LoyceV on February 02, 2020, 01:05:42 PM

Sneak preview: http://loyce.club/archive/oldposts/

I now have the first 6.1 million Bitcointalk posts archived. Data processing took longer than expected, but it's published now. See link above.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: Vod on February 09, 2020, 11:01:57 AM

LoyceV, are you saving the raw data, or just converting it?

I'm only saving the raw post (one line from the raw HTML), and my own header like this one (post number, link to post, link to my archive, by username and scraping time).

Quote from: nutildah on February 09, 2020, 11:06:33 AM

Yes, I'm all for it!

It's started already Cheesy

nutildah

legendary

Activity: 3010

Merit: 8114

Quote from: LoyceV on January 20, 2020, 08:02:51 AM

I've been thinking about expanding my archived posts to all posts that haven't been deleted yet. It would be useful for a case like this. It requires scraping a couple million pages, and storing 50+ million posts. I can limit the number of files on the server by storing 10 or 100 posts per page. Would this be useful?

Yes, I'm all for it!

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

LoyceV, are you saving the raw data, or just converting it?

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Something failed in the above scrape, the Wall Observer thread stopped scraping after page 2628. I thought I had the "last page detection" working, but there still seems to be a flaw.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Bump

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: LoyceV on January 20, 2020, 08:02:51 AM

I've been thinking about expanding my archived posts to all posts that haven't been deleted yet.

An update: I have started this project! Measured in scraping time, it's the biggest project I ever started. In the past 9 days, I've scraped about 4% of all data, so I expect to complete this around August.
There's also a chance I'll run out of disk space because of the millions of large posts made by bounty spammers, but I'll deal with that when it happens.

Sneak preview: http://loyce.club/archive/oldposts/
How to use:

Find the msgID you need. Let's use 28228
Remove the last 5 digits from the msgID to get the directory name (if there are less than 5 digits, use 0): 0
Replace the last 2 digits of the msgID by xx, and add .html (if there are less than 5 digits, use 0xx): 282xx.html
Add "#msg" and the msgID: #msg28228
Put everything together and go to http://loyce.club/archive/oldposts/0/282xx.html#msg28228

Limitations

Currently, the first 2.1 million posts are available.
I'll scrape the first 5.21 million topics and all posts in there.
That means I'll archive 53.36 million posts, this partially overlaps with my scraper for new posts.
This is a one-time thing, I won't update it with newer posts (I scrape unedited versions for those).
The time "scraped on" is Amsterdam time.

If no username is mentioned, it's either "Anonymous" or "random". I forgot those exist when I started scraping, and it's not important enough to start over.

Topic: Viewing unedited posts and deleted posts, view per post, per user or per topic - page 7. (Read 8900 times)