Pages:
Author

Topic: Bitcointalk Search Project - page 2. (Read 774 times)

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
September 28, 2024, 06:16:32 AM
#46
Scraping has resumed

(I am respecting Bitcointalk.org rate limits)
legendary
Activity: 1512
Merit: 7340
Farewell, Leo
September 21, 2024, 03:09:28 PM
#45
Just allow us to run any SELECT queries we want and you will have covered every potential feature!

No seriously. Just load the database with all the posts, and allow the user to execute any SELECT query they want and you will have covered everything we might ask. Maybe add a nice UI for basic search requests, like author and subject search, but beyond that...

Code:
CREATE USER 'readonly'@'localhost' IDENTIFIED BY 'pass';
GRANT SELECT ON bitcointalk_database.* TO 'readonly'@'localhost';
FLUSH PRIVILEGES;

 Grin
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
September 19, 2024, 02:26:09 AM
#44
Bump. I just got a new server for the scraper along with a bunch of HTTP proxies.

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).
The rules are the same as for humans. But keep in mind:
- No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)

This bold part should solve my problem with endless Cloudflare captchas. Besides, the parser is bound to take breaks once I catch up with the database downloads and it is limited to checking for new posts and edits to old posts. I wish there was a mod log that contained edited post events.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
August 06, 2024, 05:52:46 PM
#43
Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).

I use multiple parsers for backup - if one goes down for whatever reason a second one can take over.  90% of the time, my parsers have nothing to do, since I'm not parsing every profile like I did with BPIP.  I parse once every ten seconds to check for any new posts, and if any I parse them.  My record locking system has a parse delay for many things to prevent it from hitting bct too often.  I don't even parse as a logged in user.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
August 06, 2024, 10:14:09 AM
#42
My scraper was broken by Cloudflare after about 58K posts or so.
If you ask nicely, maybe theymos can whitelist your server IP in Cloudflare. That solved my download problems when Cloudflare goes in full DDoS protection mode.

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).
The rules are the same as for humans. But keep in mind:
- No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 28, 2024, 01:28:13 AM
#41
I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.

You are still thinking of ONE parser going out pretending to be another parser.  You are fighting against every fraud detection tool out there.  

Create a schedule table in your database.   Columns include jobid, lockid, lastjob and parsedelay.   When your parser grabs a job, it locks it in the table so the next parser will grab a different job.   It releases the lock when it finishes.   Your parser can call the first record in the schedule based on (lastjob+parsedelay) where lockid is free.

Edit:  Then go to one of the cloud providers and use a free service to create a second parser.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 28, 2024, 01:22:18 AM
#40
I explained how to do this five posts ago.... Smiley    ^^^

Wow you're fast  Grin

You mean this?

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 28, 2024, 01:19:15 AM
#39
My scraper was broken by Cloudflare after about 58K posts or so. I can use proxies to get around it (while keeping the same rate limit), but I will need to figure out how to integrate that into the code.

I explained how to do this five posts ago.... Smiley    ^^^
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 28, 2024, 01:18:15 AM
#38
My scraper was broken by Cloudflare after about 58K posts or so. I can use proxies to get around it (while keeping the same rate limit), but I will need to figure out how to integrate that into the code.

I do however have LoyceV's archive (thanks Loyce) But I am not sure whether it covers posts before 2018.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 20, 2024, 02:42:12 PM
#37
There's no option to ask Theymos to offer - in a way or another - the list of "last" edited posts (maybe after a date)?
I mean that the forum knows them, it only has to offer them somehow.

The only reason he couldn't do this would be if the indexing would slow down the site.   I wouldn't imagine any native SMF code that searches based on that key.  :/
legendary
Activity: 3668
Merit: 6382
Looking for campaign manager? Contact icopress!
July 20, 2024, 02:10:16 PM
#36
Edit: I will implement a heuristic that tracks when the last time a person logged into their account, and correlates that to the frequency of posts they made on particular days versus now, Banned users and inactive accounts that haven't posted for 120 days (or whatever the "This account recently woke up from a long period of inactivity" threshold is) can be excluded, so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

There's no option to ask Theymos to offer - in a way or another - the list of "last" edited posts (maybe after a date)?
I mean that the forum knows them, it only has to offer them somehow.
legendary
Activity: 1624
Merit: 2594
Top Crypto Casino
July 20, 2024, 01:53:43 PM
#35
It's not perfect obviously - edited posts probably won't be indexed for several hours like that* - but it's the best I can think of.

I also think that it won't be feasible, at least not with a single scraper. As Vod said, his post history exceeds 21,000 entries. Continuously monitoring all his posts for edits would be very resource-intensive and/or time-consuming. And what about other members with even more activity, like philipma1957, BADecker, JayJuanGee, franky1, and others? We're talking hundreds of thousands of posts that would need parsing every day...
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 20, 2024, 01:11:28 PM
#34
That's what I'm saying. I will have to manually go through all the user's posts (but hey, at least I'll have their message and topic IDs this time!) but at least I know that statistically, I will only need to search a few users at a time, because editing is infrequent compared to posting.

Manual processes will cause this project to fail - too much data.

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 20, 2024, 12:56:48 PM
#33
There is no way for you to track which post I may edit.  You would need to rescan all my 22k posts on a regular basis, or use some AI to determine what I may have edited.  The number of seconds in a day is limited to about 4x my number of posts, and I'm sure there are more than three other people posting.

That's what I'm saying. I will have to manually go through all the user's posts (but hey, at least I'll have their message and topic IDs this time!) but at least I know that statistically, I will only need to search a few users at a time, because editing is infrequent compared to posting.

It's not perfect obviously - edited posts probably won't be indexed for several hours like that* - but it's the best I can think of.

* When the initial download is finished
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 20, 2024, 12:45:13 PM
#32
so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

There is no way for you to track which post I may edit.  You would need to rescan all my 22k posts on a regular basis, or use some AI to determine what I may have edited.  The number of seconds in a day is limited to about 4x my number of posts, and I'm sure there are more than three other people posting.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 20, 2024, 12:03:19 PM
#31
We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?

This is going to be a live search engine, so every post is going to be kept up to date, and removed if the original post is removed as well.

Edit: I will implement a heuristic that tracks when the last time a person logged into their account, and correlates that to the frequency of posts they made on particular days versus now, Banned users and inactive accounts that haven't posted for 120 days (or whatever the "This account recently woke up from a long period of inactivity" threshold is) can be excluded, so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

The number of new posts daily >>> The number of edited posts daily

*I do not currently track the "last edited" time because it is an unreliable indicator for determining whether a given post might be edited in the future.
legendary
Activity: 2758
Merit: 6830
July 20, 2024, 09:25:56 AM
#30
We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?
There is no easy solution for that. It's impossible to keep track of 55 million posts.
legendary
Activity: 3346
Merit: 3125
July 20, 2024, 08:32:42 AM
#29
I'm curious about the method that you are using to get the data from each thread

You understand how we get the info - just not how we know what info to get.

It depends on what data you want to process.  If you just want to capture each thread, parse the front page every 10 seconds and grab the link from the "Newest Posts" area. 

Ok, then what would happen if a user edited his post once it was recorded in the search engine Database? That way will be impossible to search for new information on edited posts, and i feel like that will be a problem for this project.

We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
July 19, 2024, 09:45:52 AM
#28
In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.
Is that a compressed tarball?
Yes.

Quote
Because for my daily backups I usually tar my folders without compression to make it go many times faster.
I'm using pigz, which now only uses 1% of 1 CPU core. Reading from disk is the limitation. I thought I'd do you a favour by making one file instead of giving you half a million compressed files.

Quote
I'm considering putting together a website and releasing the search, but with incomplete data for now. So far, I have over 23k posts (up to roughly July 2011), but I would not like to wait months before people can actually use this.
By all means: use my data for this Smiley Only a few posts are censored by me. Other than that, the file format is pretty much the same everywhere.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 19, 2024, 07:12:29 AM
#27
In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.

Is that a compressed tarball? Because for my daily backups I usually tar my folders without compression to make it go many times faster.



I'm considering putting together a website and releasing the search, but with incomplete data for now. So far, I have over 23k posts (up to roughly July 2011), but I would not like to wait months before people can actually use this.

It would also give me time to figure out the best setup for this.
Pages:
Jump to: