Pages:
Author

Topic: Bitcointalk Search Project (Read 774 times)

Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
October 31, 2024, 03:35:04 AM
#66
Except that Google has even stronger anti-bot protections than Cloudflare - they make you solve the impossible reCaptcha if you search from a VPN address. Bypassing this with manual human workers a la 2captcha costs precious BTC.

AWS will give you a free EC2 instance to run your parser on.   Personally, I run two parsers on two instances (Europe and America) that access a central database controlling the frequency.  Costs me about $5/month.

If you don't need to login, you could use your AWS $300 credit to make thirty parsers, each hitting the forum once per second.  Do that for a month and you can reduce that to the free tier to stay up to date.

Smiley
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 31, 2024, 03:13:40 AM
#65
See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.
After re-reading this: are you sure it's Cloudflare's doing? Try setting a 1 second delay, because the rate-limiting was in place long before the forum used Cloudflare.

I have no idea to be honest. But I don't want to try to diagnose this, as debugging these kind of performance issues tend to be very non-reproducible and frustrating.

Well, that explains why it didn't work the first time that i made the search, it didn't show any result, but after looking the same terms in other browsers then google decided to show the search result. And that's why i put the result with other search engines, maybe those are the right tools for a search engine.

We could write a tool that calls those engines' API, and make the same search in all those engines giving back the forum links. That would be a cool tool.

Except that Google has even stronger anti-bot protections than Cloudflare - they make you solve the impossible reCaptcha if you search from a VPN address. Bypassing this with manual human workers a la 2captcha costs precious BTC.
legendary
Activity: 3346
Merit: 3125
October 21, 2024, 08:44:53 AM
#64
GOOGLE:
Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"
That worked very well when Google still had their "don't be evil" approach. Nowadays, Google only shows what they want you to see: Google partial blackholing of Bitcointalk?

Well, that explains why it didn't work the first time that i made the search, it didn't show any result, but after looking the same terms in other browsers then google decided to show the search result. And that's why i put the result with other search engines, maybe those are the right tools for a search engine.

We could write a tool that calls those engines' API, and make the same search in all those engines giving back the forum links. That would be a cool tool.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 09:31:43 AM
#63
GOOGLE:
Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"
That worked very well when Google still had their "don't be evil" approach. Nowadays, Google only shows what they want you to see: Google partial blackholing of Bitcointalk?

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.
After re-reading this: are you sure it's Cloudflare's doing? Try setting a 1 second delay, because the rate-limiting was in place long before the forum used Cloudflare.
legendary
Activity: 3346
Merit: 3125
October 17, 2024, 08:40:13 AM
#62
It has been months since you started this project mate, and after reading the threads i ask my self about the approach... Do we really need to download the full forum to have a good searching tool?

I don't think so, we can use the search engines online with the right commands to search for the right thread, let me show you how:

GOOGLE:

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

YAHOO:

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

DuckDuckGo

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

Yandex

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

Bing

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"


If we know how to use the search engines we should be able to find anything... Let me share some codes for your search:

Quote
Quotation Marks    Used to search for an exact phrase or sequence of words
Minus Sign (-)    Excludes specific words or phrases from search results
Asterisk (*)    Acts as a wildcard to represent any word or phrase in a search
Double Dots (..)    Used for number range searches
Site:    Restricts search results to a specific site or domain
Define:    Provides definitions of terms
Filetype:    Filters results by specific file type
Related:    Displays sites similar to the specified web page
Cache:    Shows the cached version of a web page
Link:    Finds pages that link to the specified URL
Inurl:    Searches for terms in the URL of web pages
Allinurl:    Searches for all terms in the URL of web pages
Intitle:    Searches for terms in the title of web pages
Allintitle:    Searches for all terms in the title of web pages
Intext:    Searches for terms in the body of web pages
Time:    Shows current time in various locations
Weather:    Shows weather conditions and forecasts for a location
Stocks:    Shows stock information
Info:    Displays some information that Google has about a web page
Book:    Find information about books
Phonebook:    Finds phone numbers
Movie:    Find information about movies
Area code:    Searches for the area code of a location
Currency:    Converts one currency to another
~    Used to include synonyms or similar terms in a search
AROUND(X)    Searches for words within X words of each other
City1 City2    Searches for pages containing both cities
Author:    Searches for content by a specific author
Source:    Finds news articles from a specific source
Map:    Shows maps related to the search query
Daterange:    Searches within a specific date range
Safesearch:    Filters out explicit content from search results
Music:    Find music information
Patent:    Searches for patents
Clinical trials:    Finds information on clinical trials

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 04:54:27 AM
#61
Suggestion: use my download (from years ago), and keep the posts that have been changed or removed. It would be nice if you can make it optional to search only the most recent version, or also older version of all posts.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 04:12:13 AM
#60
And another thing I've just realized - I haven't reached any mixer threads yet, but now that they are all replaced with "[banned mixer]", nobody is going to find a lot of such topics when searching by name (or URL).

So maybe I might introduce some sort of tagging system and just manually add tags to some posts (yes, posts not threads) so users can at least find what they are looking for properly.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 03:56:12 AM
#59
See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.
When I click all, I have to click "Verify" from Cloudflare. That indeed means they're "tougher" than usual.

Quote
Cloudflare would automatically block my IP address with a 1015 code if rate limits were exceeded (If I ever do get rate limited, there is an exponential backoff in place), however in this case, it looks like I have tripped an anti-DDoS measure placed on the forum's web server.
I just tried this oneliner:
Code:
i=5503125; time while test $i -le 5503134; do wget "https://bitcointalk.org/index.php?topic=$i.0" -O $i.html; sleep 1; i=$((i+1)); done
It took 13 seconds on my whitelisted server, and 12 seconds on a server that isn't whitelisted. Both include 9 seconds "sleep".
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 03:44:56 AM
#58
It looks like you use a 10 second delay between requests, is that correct? Why not reduce it to 1 second until it's done?

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.

Cloudflare would automatically block my IP address with a 1015 code if rate limits were exceeded (If I ever do get rate limited, there is an exponential backoff in place), however in this case, it looks like I have tripped an anti-DDoS measure placed on the forum's web server.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 03:28:12 AM
#57
Due to rate limiting constraints, it can only navigate to about 2.5k
It looks like you use a 10 second delay between requests, is that correct? Why not reduce it to 1 second until it's done?
At 70k threads per month, it's going to take you 6.5 years to scrape all 5.5 million threads. I remember it took me about 4 months scraping all the old posts, and that was only up to 2019. If you go from 10 to 1 second per request, it still takes 6 months.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 03:11:29 AM
#56
Those numbers still don't add up. Less than one post per thread?

Sorry, I keep instinctively writing 'post' instead of 'thread'. Just %s/post/thread/g if you know what I mean Smiley

So basically, my scraper is built in terms of topics. It takes a range of topic numbers to scrape, and works through each of them one by one, continuously clicking on the Next Page button in the process until it's on the last page.

Like this, I scan scrape 20 posts at once per page load. Though I really wish I could display more of them - that would make the process considerably faster.

About 1/3 or so of the topics I scraped don't exist, are deleted, quarantined, or nuked for whatever reason so the parser runs very fast in those cases. But for the ones that do exist, usually there are 1-2 pages only per topic, which is what the majority of running time the scraper is spent on.

Occasionally I come across 10+ or 50+ page topics. But just having a lot of pages is not going to pull the average number of scraped topics/day down unless there's hundreds of pages in the topic. Due to rate limiting constraints, it can only navigate to about 2.5k pages per day. Including the 'Not Found' pages as well.

Here is a typical scrape in progress (this is only the log file. I actually built a whole user interface around this):

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 02:46:49 AM
#55
My bad guys, I meant to write 100K threadsGrin  Bad typo.

But at the rate this is going, it's going to be scraping like 70k posts a month.
Those numbers still don't add up. Less than one post per thread?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 02:28:01 AM
#54
I crossed 100K posts scraped a few days ago.
If you're doing this at 1 request per second, and started 3 months ago, that's almost 8 million requests. There are up to 20 posts per page. Did you mean 100M posts?

My bad guys, I meant to write 100K threadsGrin  Bad typo.

But at the rate this is going, it's going to be scraping like 70k posts a month. So I better start processing that archive you gave me.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 02:18:05 AM
#53
I crossed 100K posts scraped a few days ago.
If you're doing this at 1 request per second, and started 3 months ago, that's almost 8 million requests. There are up to 20 posts per page. Did you mean 100M posts?
legendary
Activity: 1512
Merit: 7340
Farewell, Leo
October 17, 2024, 02:02:07 AM
#52
I crossed 100K posts scraped a few days ago.
Wouldn't it take years until you reach the 60 millionth posts?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 01:46:08 AM
#51
I crossed 100K posts threads scraped a few days ago.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 07, 2024, 05:06:38 AM
#50
I can't figure out how one would effectively make a search by an image.
I'd say just search the text:
Code:
[img]http://www.anita-calculators.info/assets/images/Anita841_5.jpg[/img]
So a search for "Anita calculators" should pop up this post (Ninjastic can't find it), but also "Anita841_5" or better "Anita841". Anyone searching for the word "images" is on his own, but finding the right posts when you search for "assets" is probably going to be a challenge.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 05, 2024, 11:31:04 PM
#49
Damn, I just realized that threads like this with only images on them are unscrapeable.

As in, the topic is indexed, but since there is no text, the message contents are all empty.

I can't figure out how one would effectively make a search by an image.

On a side-note, I am approaching Wall Observer-sized threads. Let's see if my scraper can grok them. It can swallow threads with a thousand or so pages, but I've never tested with tens of thousands.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 01, 2024, 10:15:46 AM
#48
In all seriousness though, I'm not in favor of exposing SQL directly, not only because of this:

'Tis a good idea - but beware SQL injection!

But also because it would be a huge performance issue very fast. Captchas will not protect the database much, as there are bot-solvers you can rent for sats per hour.

At any rate, the data was always going to be placed in an Elasticsearch cluster, not an SQL database. Searching inside paragraphs using SELECT is so difficult that it might not even be possible.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
September 29, 2024, 07:39:43 PM
#47
Just allow us to run any SELECT queries we want and you will have covered every potential feature!

No seriously. Just load the database with all the posts, and allow the user to execute any SELECT query they want and you will have covered everything we might ask. Maybe add a nice UI for basic search requests, like author and subject search, but beyond that...

Code:
CREATE USER 'readonly'@'localhost' IDENTIFIED BY 'pass';
GRANT SELECT ON bitcointalk_database.* TO 'readonly'@'localhost';
FLUSH PRIVILEGES;

 Grin

'Tis a good idea - but beware SQL injection!
Pages:
Jump to: