Pages:
Author

Topic: Bitcointalk Search Project - page 2. (Read 1110 times)

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 04:12:13 AM
#60
And another thing I've just realized - I haven't reached any mixer threads yet, but now that they are all replaced with "[banned mixer]", nobody is going to find a lot of such topics when searching by name (or URL).

So maybe I might introduce some sort of tagging system and just manually add tags to some posts (yes, posts not threads) so users can at least find what they are looking for properly.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 03:56:12 AM
#59
See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.
When I click all, I have to click "Verify" from Cloudflare. That indeed means they're "tougher" than usual.

Quote
Cloudflare would automatically block my IP address with a 1015 code if rate limits were exceeded (If I ever do get rate limited, there is an exponential backoff in place), however in this case, it looks like I have tripped an anti-DDoS measure placed on the forum's web server.
I just tried this oneliner:
Code:
i=5503125; time while test $i -le 5503134; do wget "https://bitcointalk.org/index.php?topic=$i.0" -O $i.html; sleep 1; i=$((i+1)); done
It took 13 seconds on my whitelisted server, and 12 seconds on a server that isn't whitelisted. Both include 9 seconds "sleep".
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 03:44:56 AM
#58
It looks like you use a 10 second delay between requests, is that correct? Why not reduce it to 1 second until it's done?

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.

Cloudflare would automatically block my IP address with a 1015 code if rate limits were exceeded (If I ever do get rate limited, there is an exponential backoff in place), however in this case, it looks like I have tripped an anti-DDoS measure placed on the forum's web server.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 03:28:12 AM
#57
Due to rate limiting constraints, it can only navigate to about 2.5k
It looks like you use a 10 second delay between requests, is that correct? Why not reduce it to 1 second until it's done?
At 70k threads per month, it's going to take you 6.5 years to scrape all 5.5 million threads. I remember it took me about 4 months scraping all the old posts, and that was only up to 2019. If you go from 10 to 1 second per request, it still takes 6 months.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 03:11:29 AM
#56
Those numbers still don't add up. Less than one post per thread?

Sorry, I keep instinctively writing 'post' instead of 'thread'. Just %s/post/thread/g if you know what I mean Smiley

So basically, my scraper is built in terms of topics. It takes a range of topic numbers to scrape, and works through each of them one by one, continuously clicking on the Next Page button in the process until it's on the last page.

Like this, I scan scrape 20 posts at once per page load. Though I really wish I could display more of them - that would make the process considerably faster.

About 1/3 or so of the topics I scraped don't exist, are deleted, quarantined, or nuked for whatever reason so the parser runs very fast in those cases. But for the ones that do exist, usually there are 1-2 pages only per topic, which is what the majority of running time the scraper is spent on.

Occasionally I come across 10+ or 50+ page topics. But just having a lot of pages is not going to pull the average number of scraped topics/day down unless there's hundreds of pages in the topic. Due to rate limiting constraints, it can only navigate to about 2.5k pages per day. Including the 'Not Found' pages as well.

Here is a typical scrape in progress (this is only the log file. I actually built a whole user interface around this):

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 02:46:49 AM
#55
My bad guys, I meant to write 100K threadsGrin  Bad typo.

But at the rate this is going, it's going to be scraping like 70k posts a month.
Those numbers still don't add up. Less than one post per thread?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 02:28:01 AM
#54
I crossed 100K posts scraped a few days ago.
If you're doing this at 1 request per second, and started 3 months ago, that's almost 8 million requests. There are up to 20 posts per page. Did you mean 100M posts?

My bad guys, I meant to write 100K threadsGrin  Bad typo.

But at the rate this is going, it's going to be scraping like 70k posts a month. So I better start processing that archive you gave me.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 02:18:05 AM
#53
I crossed 100K posts scraped a few days ago.
If you're doing this at 1 request per second, and started 3 months ago, that's almost 8 million requests. There are up to 20 posts per page. Did you mean 100M posts?
legendary
Activity: 1512
Merit: 7340
Farewell, Leo
October 17, 2024, 02:02:07 AM
#52
I crossed 100K posts scraped a few days ago.
Wouldn't it take years until you reach the 60 millionth posts?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 01:46:08 AM
#51
I crossed 100K posts threads scraped a few days ago.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 07, 2024, 05:06:38 AM
#50
I can't figure out how one would effectively make a search by an image.
I'd say just search the text:
Code:
[img]http://www.anita-calculators.info/assets/images/Anita841_5.jpg[/img]
So a search for "Anita calculators" should pop up this post (Ninjastic can't find it), but also "Anita841_5" or better "Anita841". Anyone searching for the word "images" is on his own, but finding the right posts when you search for "assets" is probably going to be a challenge.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 05, 2024, 11:31:04 PM
#49
Damn, I just realized that threads like this with only images on them are unscrapeable.

As in, the topic is indexed, but since there is no text, the message contents are all empty.

I can't figure out how one would effectively make a search by an image.

On a side-note, I am approaching Wall Observer-sized threads. Let's see if my scraper can grok them. It can swallow threads with a thousand or so pages, but I've never tested with tens of thousands.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 01, 2024, 10:15:46 AM
#48
In all seriousness though, I'm not in favor of exposing SQL directly, not only because of this:

'Tis a good idea - but beware SQL injection!

But also because it would be a huge performance issue very fast. Captchas will not protect the database much, as there are bot-solvers you can rent for sats per hour.

At any rate, the data was always going to be placed in an Elasticsearch cluster, not an SQL database. Searching inside paragraphs using SELECT is so difficult that it might not even be possible.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
September 29, 2024, 07:39:43 PM
#47
Just allow us to run any SELECT queries we want and you will have covered every potential feature!

No seriously. Just load the database with all the posts, and allow the user to execute any SELECT query they want and you will have covered everything we might ask. Maybe add a nice UI for basic search requests, like author and subject search, but beyond that...

Code:
CREATE USER 'readonly'@'localhost' IDENTIFIED BY 'pass';
GRANT SELECT ON bitcointalk_database.* TO 'readonly'@'localhost';
FLUSH PRIVILEGES;

 Grin

'Tis a good idea - but beware SQL injection!
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
September 28, 2024, 06:16:32 AM
#46
Scraping has resumed

(I am respecting Bitcointalk.org rate limits)
legendary
Activity: 1512
Merit: 7340
Farewell, Leo
September 21, 2024, 03:09:28 PM
#45
Just allow us to run any SELECT queries we want and you will have covered every potential feature!

No seriously. Just load the database with all the posts, and allow the user to execute any SELECT query they want and you will have covered everything we might ask. Maybe add a nice UI for basic search requests, like author and subject search, but beyond that...

Code:
CREATE USER 'readonly'@'localhost' IDENTIFIED BY 'pass';
GRANT SELECT ON bitcointalk_database.* TO 'readonly'@'localhost';
FLUSH PRIVILEGES;

 Grin
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
September 19, 2024, 02:26:09 AM
#44
Bump. I just got a new server for the scraper along with a bunch of HTTP proxies.

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).
The rules are the same as for humans. But keep in mind:
- No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)

This bold part should solve my problem with endless Cloudflare captchas. Besides, the parser is bound to take breaks once I catch up with the database downloads and it is limited to checking for new posts and edits to old posts. I wish there was a mod log that contained edited post events.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
August 06, 2024, 05:52:46 PM
#43
Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).

I use multiple parsers for backup - if one goes down for whatever reason a second one can take over.  90% of the time, my parsers have nothing to do, since I'm not parsing every profile like I did with BPIP.  I parse once every ten seconds to check for any new posts, and if any I parse them.  My record locking system has a parse delay for many things to prevent it from hitting bct too often.  I don't even parse as a logged in user.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
August 06, 2024, 10:14:09 AM
#42
My scraper was broken by Cloudflare after about 58K posts or so.
If you ask nicely, maybe theymos can whitelist your server IP in Cloudflare. That solved my download problems when Cloudflare goes in full DDoS protection mode.

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).
The rules are the same as for humans. But keep in mind:
- No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 28, 2024, 01:28:13 AM
#41
I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.

You are still thinking of ONE parser going out pretending to be another parser.  You are fighting against every fraud detection tool out there.  

Create a schedule table in your database.   Columns include jobid, lockid, lastjob and parsedelay.   When your parser grabs a job, it locks it in the table so the next parser will grab a different job.   It releases the lock when it finishes.   Your parser can call the first record in the schedule based on (lastjob+parsedelay) where lockid is free.

Edit:  Then go to one of the cloud providers and use a free service to create a second parser.
Pages:
Jump to: