Pages:
Author

Topic: Viewing unedited posts and deleted posts, view per post, per user or per topic - page 2. (Read 8900 times)

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I've reduced my scraping frequency for new posts (to reduce unnecessary server load). I created my posts archive when there were many more posts per day, and the recent page only shows the last 10 so I have to be quick.
If I start missing more posts, that's either because it's spam and MindlessElectron deleted it before I scraped them, or because there were many posts in a short interval. I'll keep an eye on this.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Full disclosure (I blame Cloudflare):
I miss all posts between posts 61426048 and 61427300. That's 1251 missing posts in 5 hours and 40 minutes.
For 8 hours and 24 minutes, my scraper wasn't working. Anything between posts 61513760 and 61515614 (1853 posts) is missing
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I missed posts 60409954-60410008, my scraper couldn't catch them because of Cloudflare.
The nearest ones I have are 60409953 and 60410009. That leaves a 15 minute gap.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I've censored the following posts in my archive because of doxing (as requested here):
The first post has been removed from Bitcointalk, the other 3 posts have been edited by theymos. I didn't bother highlighting the changed, I just wiped them.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Quote
(Theymos said he'd ban IP addresses that stress the server out with more frequent requests so that's what I have in my scraper)
You'll get a warning page if you go to fast. I've also seen bans for bots that use multiple IPs so don't go there.

Yeah, I figured already that multiple IPs would be unwelcome since it basically circumvents the 1sec restriction.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Just for clarification, that time is assuming there's a 1-second sleep between requests, right?
Yes.

Quote
(Theymos said he'd ban IP addresses that stress the server out with more frequent requests so that's what I have in my scraper)
You'll get a warning page if you go to fast. I've also seen bans for bots that use multiple IPs so don't go there.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
The problem is we can't know which posts are edited, without checking all posts again. And downloading all posts takes about half a year.

Just for clarification, that time is assuming there's a 1-second sleep between requests, right? (Theymos said he'd ban IP addresses that stress the server out with more frequent requests so that's what I have in my scraper)
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Take for example "public key addition" https://ninjastic.space/search?content=public%20key%20addition there is only 1 result on the page with the exact words in order.
I get 2,376 results on that link. If I add double quotes, I get 8 results.

Quote
For this example I was expecting some of my posts about public key addition to appear.
If I search my archive on your posts, I get only one match for "public key addition":
https://loyce.club/archive/posts/5933/59330710.html
That's your post above. If I would download your post history, I may find it again if you edited those posts later.

The problem is we can't know which posts are edited, without checking all posts again. And downloading all posts takes about half a year.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
So that I can make a better Bitcointalk search engine. I can't stand the built-in one on Simple Machines Forum - the search is slow (sometimes they take over a minute for results to load - if they even do) and the results are never specific enough.
That's exactly what I use it for (https://ninjastic.space/search). Maybe it doesn't suit your needs?

No, unfortunately. The search is sometimes not specific just like Bitcointalk search. Take for example "public key addition" https://ninjastic.space/search?content=public%20key%20addition there is only 1 result on the page with the exact words in order. For this example I was expecting some of my posts about public key addition to appear.
legendary
Activity: 2758
Merit: 6830
So that I can make a better Bitcointalk search engine. I can't stand the built-in one on Simple Machines Forum - the search is slow (sometimes they take over a minute for results to load - if they even do) and the results are never specific enough.
That's exactly what I use it for (https://ninjastic.space/search). Maybe it doesn't suit your needs?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
I am trying to scrape posts from bitcointalk to load into my Elasticsearch cluster but I have no idea when to make my script check for new posts.
Any specific reason why you need them on an ES cluster? That's something I already do to run the ninjastic.space searching.

So that I can make a better Bitcointalk search engine. I can't stand the built-in one on Simple Machines Forum - the search is slow (sometimes they take over a minute for results to load - if they even do) and the results are never specific enough.
legendary
Activity: 2758
Merit: 6830
I am trying to scrape posts from bitcointalk to load into my Elasticsearch cluster but I have no idea when to make my script check for new posts.
Any specific reason why you need them on an ES cluster? That's something I already do to run the ninjastic.space searching.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
@LoyceV how does your bot know when a new message is posted on bitcointalk?
Check Recent Posts often enough.

Quote
I am trying to scrape posts from bitcointalk to load into my Elasticsearch cluster
Maybe @TryNinja can get you a copy of his posts archive. I could do it too, but mine is in 2 different formats so more work to figure out.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
@LoyceV how does your bot know when a new message is posted on bitcointalk? Last time I checked, you could only get a message ID a link with the topic ID inside it, but it is non-trivial to find the topic ID.

I am trying to scrape posts from bitcointalk to load into my Elasticsearch cluster but I have no idea when to make my script check for new posts.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Full disclosure: I censored my own post in my archive, because I messed up quoting myself and didn't intend to share the address publicly.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Full disclosure: I temporarily censored post 57612715 in my archive. This was requested by the author (Royse777).
I don't like censoring data, and see no reason to permanently delete this. However, since it's part of an active scam investigation, it might help to keep it quiet for now. So I've removed it, and will restore the original archive in 90 days.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Theymos manually changed the ownership of 2 (deleted) posts:
Done.

Posts are ordered in threads by message ID. What we do in these cases is that we repurpose the ID of a deleted post which lies in the appropriate range.
My records still show the original post.

Post 39334982 created by ChainEX now shows as created by 1miau.
Post 57307359 created by skillscreating now shows as created by Pmalek.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I'm temporarily halting updates for the list per user and per topic, so I can merge the 50+ million older posts with the current version.
I've finished merging posts. Update are working again!
Now all "users" and "topics" list have contain all posts, including the posts made long before I joined the forum.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I'm temporarily halting updates for the list per user and per topic, so I can merge the 50+ million older posts with the current version.
I don't know yet how long this is going to take.

Update: There are now some duplicates in my posts archive:
Code:
9118. Post 51903915 (scraped 2020-07-08_Wed_09.55h, might have been edited)
9119. Post 51903915 (by LoyceV) (scraped Sun Jul 21 20:37:55 CEST 2019)
9120. Post 51903915 (by LoyceV) (scraped Sun Jul 21 20:37:55 CEST 2019)
This is going to take a bit longer to fix.

Update: The "per user" archives are working again. The "per topic" archives have more problems. I'lll restart those from scratch in the weekend.
Update: At the rate this is going, it's going to take at least a week to complete.
Update (June 9): it's going to take another week.
Pages:
Jump to: