Pages:
Author

Topic: Viewing unedited posts and deleted posts, view per post, per user or per topic - page 2. (Read 8800 times)

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I missed posts 60409954-60410008, my scraper couldn't catch them because of Cloudflare.
The nearest ones I have are 60409953 and 60410009. That leaves a 15 minute gap.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I've censored the following posts in my archive because of doxing (as requested here):
The first post has been removed from Bitcointalk, the other 3 posts have been edited by theymos. I didn't bother highlighting the changed, I just wiped them.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Quote
(Theymos said he'd ban IP addresses that stress the server out with more frequent requests so that's what I have in my scraper)
You'll get a warning page if you go to fast. I've also seen bans for bots that use multiple IPs so don't go there.

Yeah, I figured already that multiple IPs would be unwelcome since it basically circumvents the 1sec restriction.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Just for clarification, that time is assuming there's a 1-second sleep between requests, right?
Yes.

Quote
(Theymos said he'd ban IP addresses that stress the server out with more frequent requests so that's what I have in my scraper)
You'll get a warning page if you go to fast. I've also seen bans for bots that use multiple IPs so don't go there.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
The problem is we can't know which posts are edited, without checking all posts again. And downloading all posts takes about half a year.

Just for clarification, that time is assuming there's a 1-second sleep between requests, right? (Theymos said he'd ban IP addresses that stress the server out with more frequent requests so that's what I have in my scraper)
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Take for example "public key addition" https://ninjastic.space/search?content=public%20key%20addition there is only 1 result on the page with the exact words in order.
I get 2,376 results on that link. If I add double quotes, I get 8 results.

Quote
For this example I was expecting some of my posts about public key addition to appear.
If I search my archive on your posts, I get only one match for "public key addition":
https://loyce.club/archive/posts/5933/59330710.html
That's your post above. If I would download your post history, I may find it again if you edited those posts later.

The problem is we can't know which posts are edited, without checking all posts again. And downloading all posts takes about half a year.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
So that I can make a better Bitcointalk search engine. I can't stand the built-in one on Simple Machines Forum - the search is slow (sometimes they take over a minute for results to load - if they even do) and the results are never specific enough.
That's exactly what I use it for (https://ninjastic.space/search). Maybe it doesn't suit your needs?

No, unfortunately. The search is sometimes not specific just like Bitcointalk search. Take for example "public key addition" https://ninjastic.space/search?content=public%20key%20addition there is only 1 result on the page with the exact words in order. For this example I was expecting some of my posts about public key addition to appear.
legendary
Activity: 2758
Merit: 6830
So that I can make a better Bitcointalk search engine. I can't stand the built-in one on Simple Machines Forum - the search is slow (sometimes they take over a minute for results to load - if they even do) and the results are never specific enough.
That's exactly what I use it for (https://ninjastic.space/search). Maybe it doesn't suit your needs?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
I am trying to scrape posts from bitcointalk to load into my Elasticsearch cluster but I have no idea when to make my script check for new posts.
Any specific reason why you need them on an ES cluster? That's something I already do to run the ninjastic.space searching.

So that I can make a better Bitcointalk search engine. I can't stand the built-in one on Simple Machines Forum - the search is slow (sometimes they take over a minute for results to load - if they even do) and the results are never specific enough.
legendary
Activity: 2758
Merit: 6830
I am trying to scrape posts from bitcointalk to load into my Elasticsearch cluster but I have no idea when to make my script check for new posts.
Any specific reason why you need them on an ES cluster? That's something I already do to run the ninjastic.space searching.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
@LoyceV how does your bot know when a new message is posted on bitcointalk?
Check Recent Posts often enough.

Quote
I am trying to scrape posts from bitcointalk to load into my Elasticsearch cluster
Maybe @TryNinja can get you a copy of his posts archive. I could do it too, but mine is in 2 different formats so more work to figure out.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
@LoyceV how does your bot know when a new message is posted on bitcointalk? Last time I checked, you could only get a message ID a link with the topic ID inside it, but it is non-trivial to find the topic ID.

I am trying to scrape posts from bitcointalk to load into my Elasticsearch cluster but I have no idea when to make my script check for new posts.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Full disclosure: I censored my own post in my archive, because I messed up quoting myself and didn't intend to share the address publicly.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Full disclosure: I temporarily censored post 57612715 in my archive. This was requested by the author (Royse777).
I don't like censoring data, and see no reason to permanently delete this. However, since it's part of an active scam investigation, it might help to keep it quiet for now. So I've removed it, and will restore the original archive in 90 days.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Theymos manually changed the ownership of 2 (deleted) posts:
Done.

Posts are ordered in threads by message ID. What we do in these cases is that we repurpose the ID of a deleted post which lies in the appropriate range.
My records still show the original post.

Post 39334982 created by ChainEX now shows as created by 1miau.
Post 57307359 created by skillscreating now shows as created by Pmalek.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I'm temporarily halting updates for the list per user and per topic, so I can merge the 50+ million older posts with the current version.
I've finished merging posts. Update are working again!
Now all "users" and "topics" list have contain all posts, including the posts made long before I joined the forum.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I'm temporarily halting updates for the list per user and per topic, so I can merge the 50+ million older posts with the current version.
I don't know yet how long this is going to take.

Update: There are now some duplicates in my posts archive:
Code:
9118. Post 51903915 (scraped 2020-07-08_Wed_09.55h, might have been edited)
9119. Post 51903915 (by LoyceV) (scraped Sun Jul 21 20:37:55 CEST 2019)
9120. Post 51903915 (by LoyceV) (scraped Sun Jul 21 20:37:55 CEST 2019)
This is going to take a bit longer to fix.

Update: The "per user" archives are working again. The "per topic" archives have more problems. I'lll restart those from scratch in the weekend.
Update: At the rate this is going, it's going to take at least a week to complete.
Update (June 9): it's going to take another week.
legendary
Activity: 2758
Merit: 6830
I'm also guessing TryNinja catches missing posts by downloading the next recent page.
That was my plan, but I never actually implemented this. I just scrape the first (recent posts) page every ~5 seconds.

Maybe it's just a loterry and I also missed a few posts that you catched, but I can't verify this now.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Update on posts missed by my scraper

To troubleshoot, I'm now saving every downloaded version of recent. I'll check again for missing posts in a few days.
Let's see the results Cheesy

Since this last post, I've scraped 7 full sets of 10,000 post: 5681xxxx-5687xxxx.
For each directory, I managed to archive this many posts:
Code:
5681: 9994
5682: 9994
5683: 9994
5684: 9998
5685: 9995
5686: 9993
5687: 9994

The following posts are missing:
Code:
5681
     1  56811685
     2  56813372
     3  56815013
     4  56815785
     5  56817721
     6  56817792
5682
     1  56820249
     2  56820735
     3  56821187
     4  56821286
     5  56826406
     6  56828175
5683
     1  56831138
     2  56834741
     3  56835992
     4  56838503
     5  56839353
     6  56839667
5684
     1  56844528
     2  56849500
5685
     1  56852011
     2  56852479
     3  56852686
     4  56852777
     5  56853752
5686
     1  56860300
     2  56862521
     3  56864014
     4  56864015
     5  56864020
     6  56867118
     7  56867655
5687
     1  56870182
     2  56874327
     3  56875992
     4  56876002
     5  56877337
     6  56878276
Only 2 posts are consecutive: 56864014 and 56864015, which could indicate a scraper-problem. Ninjastic Space doesn't have them either, so I guess it's a coincidence.

I've checked all of them. The following posts exist on Ninjastic Space:
https://ninjastic.space/post/56815785 spam
https://ninjastic.space/post/56820249 spam
https://ninjastic.space/post/56820735 spam
https://ninjastic.space/post/56826406 > Wall Observer
https://ninjastic.space/post/56835992 > Wall Observer
https://ninjastic.space/post/56839667 spam
https://ninjastic.space/post/56852686 > Wall Observer
https://ninjastic.space/post/56852777 > Russian
https://ninjastic.space/post/56867655 > Wall Observer
https://ninjastic.space/post/56874327 spam
https://ninjastic.space/post/56877337 spam
The spam gets removed within seconds by MindlessElectron, so missing those posts is acceptable.
That leaves 5 posts (0.007%) that I shouldn't have missed (assuming the other missing posts were either spam or posted on hidden boards).
I've noticed before that I sometimes miss posts made in the Wall Observer thread. I'm guessing it might have to do with the size of the thread.

I found the cause when checking the first missing Wall Observer post (56826406): It's missing from all versions of Recent Posts that I downloaded.
See: https://loyce.club/other/recent/recent.Tue%20Apr%2020%2013:35:13%20CEST%202021.html
This version has posts 56826404, 56826407 and 56826408. The posts in between are missing.
Now compare: https://loyce.club/other/recent/recent.Tue%20Apr%2020%2013:35:55%20CEST%202021.html
This version has posts 56826404 and 56826408. The posts in between are missing, which is why I couldn't scrape post 56826406.

I'm guessing SMF can't really handle the volume of new posts being created (and deleted).

I'm also guessing TryNinja catches missing posts by downloading the next recent page. My plan is to implement that from another VPS for missing posts, but I need some time to do this.
Pages:
Jump to: