Bitcointalk Search Project | Bitcointalksearch.org

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Using Zendriver, we can scrape two or three pages per second, without any selenium-enforced delay.

I implemented an exponential backoff for when the forum server returns 503 errors, starting from one second and multiplying by two for each attempt.

With these changes, the scraper has become 10x faster, and continues to make the mission of organizing the forum's posts more feasible.

Once most of the posts are saved, the server can be queried at much longer intervals.

https://streamable.com/npfr6r

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

I managed to parallelize the scraper - with some quirks that I'm in the process of fixing. This new scraper is more reliable than the old one, and doesn't crash as often. It may even be faster too, but I haven't ran any benchmarks yet.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: LoyceV on December 20, 2024, 01:35:14 AM

Quote

I don't know if that's going to be fast or not

Alternatively, you could just run a cronjob with rsync on one of the servers, regularly fetching all updates.

I like this option. No changes to my script required and only a small administrative addition to the server.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: NotATether on December 20, 2024, 01:31:05 AM

I'd like to make it export results to my old VPS's filesystem, using something like SSHFS.

Before my dedicated server disappeared, I used to mount it's directories on another server through sshfs. That worked fine and didn't disconnect as long as both servers were running.

Quote

I don't know if that's going to be fast or not

Alternatively, you could just run a cronjob with rsync on one of the servers, regularly fetching all updates.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

I have a new VPS now, but I'd like to make it export results to my old VPS's filesystem, using something like SSHFS.

Currently I don't exactly have that implemented yet but it wouldn't hurt to try. I don't know if that's going to be fast or not, bu it's just one file per second and both of the servers have gigabit lines.

The scraper hasn't been running for a while, so now might be a good time for me to try it.

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

Quote from: TryNinja on December 01, 2024, 09:27:58 AM

Of course broken quotes can't be easily treated, maybe with AI which is too out of my league.

I wrote one of the first PPC search engines after goto/google, (sold it to a US corp in 2000) and it was tough back then just to determine the actual keywords. This idea is a lot more difficult due to multiple sources in the same post and all the modern language sets. I'm giving up on building any kind of search engine on my data, other than the basic SQL queries. I'll use my data for something interesting and useful that has not been done yet.

TryNinja

legendary

Activity: 2758

Merit: 6830

Quote from: NotATether on December 01, 2024, 08:41:25 AM

Currently I'm stripping out the quotes. Also code blocks. Maybe I will make a second run to extract those and assign them as post metadata somehow.

I'll have to figure out a way to handle nested quotes. And many quotes in the same reply.

A second run seems a waste of time and resources.

For sure there should be some untested cases, but I've been getting good results with this:

Code:

type PostContent = {
raw_content: string;
content: string;
quoted_users: string[];
quotes: string[];
};

function extractPostContent(html: string): PostContent {
const $ = load(html);

const result: PostContent = {
raw_content: html,
content: '',
quoted_users: [],
quotes: []
};

function extractTextContent(element: cheerio.Cheerio): string {
return element
.clone()
.children('br')
.each((_, el) => {
$(el).replaceWith(' ');
})
.end()
.children('.quoteheader')
.each((_, el) => {
if ($(el).children('a').length > 0) {
$(el.next).remove();
}
$(el).text(' ');
})
.end()
.text()
.trim();
}

function processQuote(element: cheerio.Cheerio) {
const quoteHeader = element.prev('.quoteheader');
if (quoteHeader.length) {
const userMatch = quoteHeader.text().match(/Quote from: (.+?) on/);
if (userMatch) {
result.quoted_users.push(userMatch[1]);
}
}

const quoteContent = extractTextContent(element);
if (quoteContent) {
result.quotes.push(quoteContent);
}

element.find('> .quote').each((_, nestedQuote) => {
processQuote($(nestedQuote));
});
}

$('.quote').each((_, quote) => {
if ($(quote).parent().hasClass('quote') || $(quote).prev('.quoteheader').children('a').length === 0) return;
processQuote($(quote));
});

result.content = extractTextContent($('body'));

$('.quoteheader').each((_, element) => {
if ($(element).children('a').length > 0) {
const elementText = $(element.next).text();
result.content = result.content.replace(elementText, '');
}
});

result.content = result.content.trim();
result.quoted_users = [...new Set(result.quoted_users)];

return result;
}

Of course broken quotes can't be easily treated, maybe with AI which is too out of my league.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: Vod on November 30, 2024, 01:15:54 AM

OP - have you developed a way to logically parse through each post and assign quotes to the proper person? Because it's open input, people can modify it to anything, so a well organized system is necessary. It's bugging the heck out of me, been working on it for two days now. :/

I'm asking because I know your attention is focused on another project atm, so hopefully you'd share. Grin

Currently I'm stripping out the quotes. Also code blocks. Maybe I will make a second run to extract those and assign them as post metadata somehow.

I'll have to figure out a way to handle nested quotes. And many quotes in the same reply.

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

OP - have you developed a way to logically parse through each post and assign quotes to the proper person? Because it's open input, people can modify it to anything, so a well organized system is necessary. It's bugging the heck out of me, been working on it for two days now. :/

I'm asking because I know your attention is focused on another project atm, so hopefully you'd share. Grin

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

Quote from: NotATether on November 24, 2024, 10:51:05 PM

Anyone know any providers that easily let me install Arch Linux? Or failing that, bare metal servers.

https://aws.amazon.com/about-aws/whats-new/2023/10/new-amazon-ec2-bare-metal-instances/

Again, AWS gives $300 - $1300 to new accounts.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: NotATether on November 24, 2024, 10:51:05 PM

Anyone know any providers that easily let me install Arch Linux? Or failing that, bare metal servers.

I checked the 3 providers where I have accounts (RamNode, HostVDS and RackNerd), and none of them have Arch Linux as a default choice. If you don't mind me asking: why Arch Linux? Would uploading your own ISO be an option? I've never tried it, but RamNode supports it.

BlackHatCoiner

legendary

Activity: 1512

Merit: 7340

Farewell, Leo

Quote from: NotATether on November 24, 2024, 10:51:05 PM

Welp, it looks like my IP was rate-limited immediately after, according to logs. Time to spin up a new server

Do you mean that your IP is blocked from Cloudflare? If that's so, why don't you use a VPN?

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

My scraper managed to crawl half of the Wall Observer topic before finally being defeated by Cloudflare. It ingested the pages in a bit over a day.

Still, an impressive achievement, considering this is the longest topic on the the website, by a wide margin.

I'm going to add the capability of checking for new messages on a thread sooner or later.

Welp, it looks like my IP was rate-limited immediately after, according to logs. Time to spin up a new server

Anyone know any providers that easily let me install Arch Linux? Or failing that, bare metal servers.

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

Quote from: NotATether on October 31, 2024, 03:13:40 AM

Except that Google has even stronger anti-bot protections than Cloudflare - they make you solve the impossible reCaptcha if you search from a VPN address. Bypassing this with manual human workers a la 2captcha costs precious BTC.

AWS will give you a free EC2 instance to run your parser on. Personally, I run two parsers on two instances (Europe and America) that access a central database controlling the frequency. Costs me about $5/month.

If you don't need to login, you could use your AWS $300 credit to make thirty parsers, each hitting the forum once per second. Do that for a month and you can reduce that to the free tier to stay up to date.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: LoyceV on October 17, 2024, 09:31:43 AM

Quote from: NotATether on October 17, 2024, 03:44:56 AM

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.

After re-reading this: are you sure it's Cloudflare's doing? Try setting a 1 second delay, because the rate-limiting was in place long before the forum used Cloudflare.

I have no idea to be honest. But I don't want to try to diagnose this, as debugging these kind of performance issues tend to be very non-reproducible and frustrating.

Quote from: seoincorporation on October 21, 2024, 08:44:53 AM

Well, that explains why it didn't work the first time that i made the search, it didn't show any result, but after looking the same terms in other browsers then google decided to show the search result. And that's why i put the result with other search engines, maybe those are the right tools for a search engine.

We could write a tool that calls those engines' API, and make the same search in all those engines giving back the forum links. That would be a cool tool.

Except that Google has even stronger anti-bot protections than Cloudflare - they make you solve the impossible reCaptcha if you search from a VPN address. Bypassing this with manual human workers a la 2captcha costs precious BTC.

seoincorporation

legendary

Activity: 3388

Merit: 3154

Quote from: LoyceV on October 17, 2024, 09:31:43 AM

Quote from: seoincorporation on October 17, 2024, 08:40:13 AM

GOOGLE:

Code:

site:bitcointalk.org intitle:"x330,000" "seoincorporation"

That worked very well when Google still had their "don't be evil" approach. Nowadays, Google only shows what they want you to see: Google partial blackholing of Bitcointalk?

Well, that explains why it didn't work the first time that i made the search, it didn't show any result, but after looking the same terms in other browsers then google decided to show the search result. And that's why i put the result with other search engines, maybe those are the right tools for a search engine.

We could write a tool that calls those engines' API, and make the same search in all those engines giving back the forum links. That would be a cool tool.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: seoincorporation on October 17, 2024, 08:40:13 AM

GOOGLE:

Code:

site:bitcointalk.org intitle:"x330,000" "seoincorporation"

That worked very well when Google still had their "don't be evil" approach. Nowadays, Google only shows what they want you to see: Google partial blackholing of Bitcointalk?

Quote from: NotATether on October 17, 2024, 03:44:56 AM

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.

After re-reading this: are you sure it's Cloudflare's doing? Try setting a 1 second delay, because the rate-limiting was in place long before the forum used Cloudflare.

seoincorporation

legendary

Activity: 3388

Merit: 3154

It has been months since you started this project mate, and after reading the threads i ask my self about the approach... Do we really need to download the full forum to have a good searching tool?

I don't think so, we can use the search engines online with the right commands to search for the right thread, let me show you how:

GOOGLE:

Code:

site:bitcointalk.org intitle:"x330,000" "seoincorporation"

YAHOO:

Code:

site:bitcointalk.org intitle:"x330,000" "seoincorporation"

DuckDuckGo

Code:

site:bitcointalk.org intitle:"x330,000" "seoincorporation"

Yandex

Code:

site:bitcointalk.org intitle:"x330,000" "seoincorporation"

Bing

Code:

site:bitcointalk.org intitle:"x330,000" "seoincorporation"

If we know how to use the search engines we should be able to find anything... Let me share some codes for your search:

Quote

Quotation Marks    Used to search for an exact phrase or sequence of words
Minus Sign (-)    Excludes specific words or phrases from search results
Asterisk (*)    Acts as a wildcard to represent any word or phrase in a search
Double Dots (..)    Used for number range searches
Site:    Restricts search results to a specific site or domain
Define:    Provides definitions of terms
Filetype:    Filters results by specific file type
Related:    Displays sites similar to the specified web page
Cache:    Shows the cached version of a web page
Link:    Finds pages that link to the specified URL
Inurl:    Searches for terms in the URL of web pages
Allinurl:    Searches for all terms in the URL of web pages
Intitle:    Searches for terms in the title of web pages
Allintitle:    Searches for all terms in the title of web pages
Intext:    Searches for terms in the body of web pages
Time:    Shows current time in various locations
Weather:    Shows weather conditions and forecasts for a location
Stocks:    Shows stock information
Info:    Displays some information that Google has about a web page
Book:    Find information about books
Phonebook:    Finds phone numbers
Movie:    Find information about movies
Area code:    Searches for the area code of a location
Currency:    Converts one currency to another
~    Used to include synonyms or similar terms in a search
AROUND(X)    Searches for words within X words of each other
City1 City2    Searches for pages containing both cities
Author:    Searches for content by a specific author
Source:    Finds news articles from a specific source
Map:    Shows maps related to the search query
Daterange:    Searches within a specific date range
Safesearch:    Filters out explicit content from search results
Music:    Find music information
Patent:    Searches for patents
Clinical trials:    Finds information on clinical trials

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Suggestion: use my download (from years ago), and keep the posts that have been changed or removed. It would be nice if you can make it optional to search only the most recent version, or also older version of all posts.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

And another thing I've just realized - I haven't reached any mixer threads yet, but now that they are all replaced with "[banned mixer]", nobody is going to find a lot of such topics when searching by name (or URL).

So maybe I might introduce some sort of tagging system and just manually add tags to some posts (yes, posts not threads) so users can at least find what they are looking for properly.

Topic: Bitcointalk Search Project (Read 1032 times)