Bitcointalk Search Project - page 2.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

And another thing I've just realized - I haven't reached any mixer threads yet, but now that they are all replaced with "[banned mixer]", nobody is going to find a lot of such topics when searching by name (or URL).

So maybe I might introduce some sort of tagging system and just manually add tags to some posts (yes, posts not threads) so users can at least find what they are looking for properly.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: NotATether on October 17, 2024, 03:44:56 AM

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.

When I click all, I have to click "Verify" from Cloudflare. That indeed means they're "tougher" than usual.

Quote

Cloudflare would automatically block my IP address with a 1015 code if rate limits were exceeded (If I ever do get rate limited, there is an exponential backoff in place), however in this case, it looks like I have tripped an anti-DDoS measure placed on the forum's web server.

I just tried this oneliner:

Code:

i=5503125; time while test $i -le 5503134; do wget "https://bitcointalk.org/index.php?topic=$i.0" -O $i.html; sleep 1; i=$((i+1)); done

It took 13 seconds on my whitelisted server, and 12 seconds on a server that isn't whitelisted. Both include 9 seconds "sleep".

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: LoyceV on October 17, 2024, 03:28:12 AM

It looks like you use a 10 second delay between requests, is that correct? Why not reduce it to 1 second until it's done?

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.

Cloudflare would automatically block my IP address with a 1015 code if rate limits were exceeded (If I ever do get rate limited, there is an exponential backoff in place), however in this case, it looks like I have tripped an anti-DDoS measure placed on the forum's web server.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: NotATether on October 17, 2024, 03:11:29 AM

Due to rate limiting constraints, it can only navigate to about 2.5k

It looks like you use a 10 second delay between requests, is that correct? Why not reduce it to 1 second until it's done?
At 70k threads per month, it's going to take you 6.5 years to scrape all 5.5 million threads. I remember it took me about 4 months scraping all the old posts, and that was only up to 2019. If you go from 10 to 1 second per request, it still takes 6 months.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: LoyceV on October 17, 2024, 02:46:49 AM

Those numbers still don't add up. Less than one post per thread?

Sorry, I keep instinctively writing 'post' instead of 'thread'. Just %s/post/thread/g if you know what I mean

So basically, my scraper is built in terms of topics. It takes a range of topic numbers to scrape, and works through each of them one by one, continuously clicking on the Next Page button in the process until it's on the last page.

Like this, I scan scrape 20 posts at once per page load. Though I really wish I could display more of them - that would make the process considerably faster.

About 1/3 or so of the topics I scraped don't exist, are deleted, quarantined, or nuked for whatever reason so the parser runs very fast in those cases. But for the ones that do exist, usually there are 1-2 pages only per topic, which is what the majority of running time the scraper is spent on.

Occasionally I come across 10+ or 50+ page topics. But just having a lot of pages is not going to pull the average number of scraped topics/day down unless there's hundreds of pages in the topic. Due to rate limiting constraints, it can only navigate to about 2.5k pages per day. Including the 'Not Found' pages as well.

Here is a typical scrape in progress (this is only the log file. I actually built a whole user interface around this):

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: NotATether on October 17, 2024, 02:28:01 AM

My bad guys, I meant to write 100K threads. Grin

Bad typo.

But at the rate this is going, it's going to be scraping like 70k posts a month.

Those numbers still don't add up. Less than one post per thread?

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: LoyceV on October 17, 2024, 02:18:05 AM

Quote from: NotATether on October 17, 2024, 01:46:08 AM

I crossed 100K posts scraped a few days ago.

If you're doing this at 1 request per second, and started 3 months ago, that's almost 8 million requests. There are up to 20 posts per page. Did you mean 100M posts?

My bad guys, I meant to write 100K threads. Grin

Bad typo.

But at the rate this is going, it's going to be scraping like 70k posts a month. So I better start processing that archive you gave me.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: NotATether on October 17, 2024, 01:46:08 AM

I crossed 100K posts scraped a few days ago.

If you're doing this at 1 request per second, and started 3 months ago, that's almost 8 million requests. There are up to 20 posts per page. Did you mean 100M posts?

BlackHatCoiner

legendary

Activity: 1512

Merit: 7340

Farewell, Leo

Quote from: NotATether on October 17, 2024, 01:46:08 AM

I crossed 100K posts scraped a few days ago.

Wouldn't it take years until you reach the 60 millionth posts?

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

I crossed 100K ~~posts~~ threads scraped a few days ago.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: NotATether on October 05, 2024, 11:31:04 PM

I can't figure out how one would effectively make a search by an image.

Quote from: Phinnaeus Gage on May 20, 2012, 08:59:19 AM

I'd say just search the text:

Code:

[img]http://www.anita-calculators.info/assets/images/Anita841_5.jpg[/img]

So a search for "Anita calculators" should pop up this post (Ninjastic can't find it), but also "Anita841_5" or better "Anita841". Anyone searching for the word "images" is on his own, but finding the right posts when you search for "assets" is probably going to be a challenge.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Damn, I just realized that threads like this with only images on them are unscrapeable.

As in, the topic is indexed, but since there is no text, the message contents are all empty.

I can't figure out how one would effectively make a search by an image.

On a side-note, I am approaching Wall Observer-sized threads. Let's see if my scraper can grok them. It can swallow threads with a thousand or so pages, but I've never tested with tens of thousands.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

In all seriousness though, I'm not in favor of exposing SQL directly, not only because of this:

Quote from: Vod on September 29, 2024, 07:39:43 PM

'Tis a good idea - but beware SQL injection!

But also because it would be a huge performance issue very fast. Captchas will not protect the database much, as there are bot-solvers you can rent for sats per hour.

At any rate, the data was always going to be placed in an Elasticsearch cluster, not an SQL database. Searching inside paragraphs using SELECT is so difficult that it might not even be possible.

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

Quote from: BlackHatCoiner on September 21, 2024, 03:09:28 PM

Just allow us to run any SELECT queries we want and you will have covered every potential feature!

No seriously. Just load the database with all the posts, and allow the user to execute any SELECT query they want and you will have covered everything we might ask. Maybe add a nice UI for basic search requests, like author and subject search, but beyond that...

Code:

CREATE USER 'readonly'@'localhost' IDENTIFIED BY 'pass';
GRANT SELECT ON bitcointalk_database.* TO 'readonly'@'localhost';
FLUSH PRIVILEGES;

'Tis a good idea - but beware SQL injection!

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Scraping has resumed

(I am respecting Bitcointalk.org rate limits)

BlackHatCoiner

legendary

Activity: 1512

Merit: 7340

Farewell, Leo

Just allow us to run any SELECT queries we want and you will have covered every potential feature!

No seriously. Just load the database with all the posts, and allow the user to execute any SELECT query they want and you will have covered everything we might ask. Maybe add a nice UI for basic search requests, like author and subject search, but beyond that...

Code:

CREATE USER 'readonly'@'localhost' IDENTIFIED BY 'pass';
GRANT SELECT ON bitcointalk_database.* TO 'readonly'@'localhost';
FLUSH PRIVILEGES;

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Bump. I just got a new server for the scraper along with a bunch of HTTP proxies.

Quote from: LoyceV on August 06, 2024, 10:14:09 AM

Quote from: Vod on July 20, 2024, 01:11:28 PM

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).

Quote from: theymos on February 12, 2015, 05:03:56 PM

The rules are the same as for humans. But keep in mind:
- No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)

This bold part should solve my problem with endless Cloudflare captchas. Besides, the parser is bound to take breaks once I catch up with the database downloads and it is limited to checking for new posts and edits to old posts. I wish there was a mod log that contained edited post events.

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

Quote from: LoyceV on August 06, 2024, 10:14:09 AM

Quote from: Vod on July 20, 2024, 01:11:28 PM

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).

I use multiple parsers for backup - if one goes down for whatever reason a second one can take over. 90% of the time, my parsers have nothing to do, since I'm not parsing every profile like I did with BPIP. I parse once every ten seconds to check for any new posts, and if any I parse them. My record locking system has a parse delay for many things to prevent it from hitting bct too often. I don't even parse as a logged in user.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: NotATether on July 28, 2024, 01:18:15 AM

My scraper was broken by Cloudflare after about 58K posts or so.

If you ask nicely, maybe theymos can whitelist your server IP in Cloudflare. That solved my download problems when Cloudflare goes in full DDoS protection mode.

Quote from: Vod on July 20, 2024, 01:11:28 PM

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).

Quote from: theymos on February 12, 2015, 05:03:56 PM

The rules are the same as for humans. But keep in mind:
- No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

Quote from: NotATether on July 28, 2024, 01:22:18 AM

I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.

You are still thinking of ONE parser going out pretending to be another parser. You are fighting against every fraud detection tool out there.

Create a schedule table in your database. Columns include jobid, lockid, lastjob and parsedelay. When your parser grabs a job, it locks it in the table so the next parser will grab a different job. It releases the lock when it finishes. Your parser can call the first record in the schedule based on (lastjob+parsedelay) where lockid is free.

Edit: Then go to one of the cloud providers and use a free service to create a second parser.

Topic: Bitcointalk Search Project - page 2. (Read 1110 times)