Pages:
Author

Topic: Viewing unedited posts and deleted posts, view per post, per user or per topic - page 7. (Read 8635 times)

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I see here that a thread was captured with just the error message for wrong BBcode:
http://loyce.club/archive/posts/5395/53951381.html

Any way to include the content in spite of wrong BBcode?
I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing.
legendary
Activity: 2394
Merit: 1412
Leading Crypto Sports Betting & Casino Platform
I see here that a thread was captured with just the error message for wrong BBcode:
http://loyce.club/archive/posts/5395/53951381.html

Any way to include the content in spite of wrong BBcode?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
How to use it
  • Find the msgID, userID or topicID you need. Let's use msgID 51902990.
  • Remove the last 4 digits from the msgID to get the directory name (if there are less than 4 digits, use 0): 5190.
  • Put everything together behind the (above) URL and add ".html": http://loyce.club/archive/posts/5190/51902990.html.
This is an example of how I use it in practice: I copy the topicID (5229466), then type "topics" on my URL-bar. My browser suggests http://loyce.club/archive/topics/, which I select. Then, I paste the topicID, hit Backspace 4 times, type "/", paste the topicID again, type ".html" and hit enter. It takes some getting used to, but in just 5 seconds I have the page I was looking for: http://loyce.club/archive/topics/522/5229466.html.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
copper member
Activity: 1540
Merit: 487
Stop the war!
I now have the first 6.1 million Bitcointalk posts archived. Data processing took longer than expected, but it's published now. See link above.

38 million left.
How fast does your parsing work?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
If it is not a secret, how much data space is needed for all that millions of posts?
I'm currenly using 54 GB for loyce.club, and store 4.2 million files.

Viewing unedited/deleted posts

How to use it
  • Find the msgID, userID or topicID you need. Let's use msgID 51902990.
  • Remove the last 4 digits from the msgID to get the directory name (if there are less than 4 digits, use 0): 5190.
  • Put everything together behind the (above) URL and add ".html": http://loyce.club/archive/posts/5190/51902990.html.

Details
  • Files are stored with their msgID, userID or topicID as file name. I remove the last 4 digits to create the directory name. Each directory contains up to 10,000 HTML-files. Use CTRL-F to find what you're looking for.
  • I don't scrape hidden boards (such as Investigations).
  • I don't keep post titles
  • I save raw HTML, including quotes
  • If I run out of disk space, I might create compressed archives per 10,000 posts.
  • Although I plan to preserve all data, I make no guarantees. Feel free to archive posts.
  • My current (sponsored) webhost has enough storage space for years to come.
  • All scrape-times use Amsterdam time (CET).
  • Usually, I capture at least 99.95% of all posts. Server or internet connection problems can severely reduce this.

Examples
legendary
Activity: 2212
Merit: 7064
Cashback 15%
If it is not a secret, how much data space is needed for all that millions of posts?
And is there a way to use some compression?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Update: I finally got to work on a topic-view. It'll be published at http://loyce.club/archive/topics/, and it's currently crunching data on 1.9 million posts. I'm downloading the thread titles that I don't have yet, and that part takes a lot of time.

This viewer should make it much easier to find back all posts made in a certain thread. Obviously this is also limited to posts made after I started scraping.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Bump!

I now have the first 6.1 million posts, I'm currently still processing them for publication. Will be updated soon.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I now have the first 6.1 million Bitcointalk posts archived. Data processing took longer than expected, but it's published now. See link above.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
LoyceV, are you saving the raw data, or just converting it?  
I'm only saving the raw post (one line from the raw HTML), and my own header like this one (post number, link to post, link to my archive, by username and scraping time).

Yes, I'm all for it!
It's started already Cheesy
legendary
Activity: 2940
Merit: 7892
I've been thinking about expanding my archived posts to all posts that haven't been deleted yet. It would be useful for a case like this. It requires scraping a couple million pages, and storing 50+ million posts. I can limit the number of files on the server by storing 10 or 100 posts per page. Would this be useful?

Yes, I'm all for it!
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
LoyceV, are you saving the raw data, or just converting it? 
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Something failed in the above scrape, the Wall Observer thread stopped scraping after page 2628. I thought I had the "last page detection" working, but there still seems to be a flaw.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I've been thinking about expanding my archived posts to all posts that haven't been deleted yet.
An update: I have started this project! Measured in scraping time, it's the biggest project I ever started. In the past 9 days, I've scraped about 4% of all data, so I expect to complete this around August.
There's also a chance I'll run out of disk space because of the millions of large posts made by bounty spammers, but I'll deal with that when it happens.

Sneak preview: http://loyce.club/archive/oldposts/
How to use:
  • Find the msgID you need. Let's use 28228
  • Remove the last 5 digits from the msgID to get the directory name (if there are less than 5 digits, use 0): 0
  • Replace the last 2 digits of the msgID by xx, and add .html (if there are less than 5 digits, use 0xx): 282xx.html
  • Add "#msg" and the msgID: #msg28228
  • Put everything together and go to http://loyce.club/archive/oldposts/0/282xx.html#msg28228

Limitations
  • Currently, the first 2.1 million posts are available.
  • I'll scrape the first 5.21 million topics and all posts in there.
  • That means I'll archive 53.36 million posts, this partially overlaps with my scraper for new posts.
  • This is a one-time thing, I won't update it with newer posts (I scrape unedited versions for those).
  • The time "scraped on" is Amsterdam time.

If no username is mentioned, it's either "Anonymous" or "random". I forgot those exist when I started scraping, and it's not important enough to start over.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I've been thinking about expanding my archived posts to all posts that haven't been deleted yet. It would be useful for a case like this. It requires scraping a couple million pages, and storing 50+ million posts. I can limit the number of files on the server by storing 10 or 100 posts per page. Would this be useful?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I am trying to look for archived posts of a particular user from http://loyce.club/archive/members because his account was nuked but the Profile ID doesn't show up

The last update was on 2020-01-04 16:27 but on http://loyce.club/archive/posts the last update was just today, 2020-01-11 12:31. Could it be an irregularity or a technical error preventing an update on the members link.
I still haven't fully tested this part yet, so updates are only manually started. I'm running an update now, it shouldn't take too long to process a week's worth of posts.
Update: see http://loyce.club/archive/members/274/2743460.html

I think you keep merit even if a post is deleted? Some one correct me if I am wrong.
Correct. And TryNinja is also right, I just don't like my "Merit earned for deleted posts"-counter to go up.
legendary
Activity: 2758
Merit: 6830
Bump (I don't want to delete my previous bump, because it's merited).

I think you keep merit even if a post is deleted? Some one correct me if I am wrong.
You do. He’s not worried about that.

He probably wants to keep the reference for which post earned him his merit. Otherwise, it says it was for post “Deleted”.
Pages:
Jump to: