Pages:
Author

Topic: Bitcointalk Search Project (Read 974 times)

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
December 20, 2024, 01:53:41 AM
#77
Quote
I don't know if that's going to be fast or not
Alternatively, you could just run a cronjob with rsync on one of the servers, regularly fetching all updates.

I like this option. No changes to my script required and only a small administrative addition to the server.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
December 20, 2024, 01:35:14 AM
#76
I'd like to make it export results to my old VPS's filesystem, using something like SSHFS.
Before my dedicated server disappeared, I used to mount it's directories on another server through sshfs. That worked fine and didn't disconnect as long as both servers were running.

Quote
I don't know if that's going to be fast or not
Alternatively, you could just run a cronjob with rsync on one of the servers, regularly fetching all updates.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
December 20, 2024, 01:31:05 AM
#75
I have a new VPS now, but I'd like to make it export results to my old VPS's filesystem, using something like SSHFS.

Currently I don't exactly have that implemented yet but it wouldn't hurt to try. I don't know if that's going to be fast or not, bu it's just one file per second and both of the servers have gigabit lines.

The scraper hasn't been running for a while, so now might be a good time for me to try it.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
December 01, 2024, 03:58:30 PM
#74
Of course broken quotes can't be easily treated, maybe with AI which is too out of my league.

I wrote one of the first PPC search engines after goto/google, (sold it to a US corp in 2000) and it was tough back then just to determine the actual keywords.  This idea is a lot more difficult due to multiple sources in the same post and all the modern language sets.  I'm giving up on building any kind of search engine on my data, other than the basic SQL queries.   I'll use my data for something interesting and useful that has not been done yet.  Smiley
legendary
Activity: 2758
Merit: 6830
December 01, 2024, 09:27:58 AM
#73
Currently I'm stripping out the quotes. Also code blocks. Maybe I will make a second run to extract those and assign them as post metadata somehow.

I'll have to figure out a way to handle nested quotes. And many quotes in the same reply.
A second run seems a waste of time and resources.

For sure there should be some untested cases, but I've been getting good results with this:

Code:
type PostContent = {
  raw_content: string;
  content: string;
  quoted_users: string[];
  quotes: string[];
};

function extractPostContent(html: string): PostContent {
  const $ = load(html);

  const result: PostContent = {
    raw_content: html,
    content: '',
    quoted_users: [],
    quotes: []
  };

  function extractTextContent(element: cheerio.Cheerio): string {
    return element
      .clone()
      .children('br')
      .each((_, el) => {
        $(el).replaceWith(' ');
      })
      .end()
      .children('.quoteheader')
      .each((_, el) => {
        if ($(el).children('a').length > 0) {
          $(el.next).remove();
        }
        $(el).text(' ');
      })
      .end()
      .text()
      .trim();
  }

  function processQuote(element: cheerio.Cheerio) {
    const quoteHeader = element.prev('.quoteheader');
    if (quoteHeader.length) {
      const userMatch = quoteHeader.text().match(/Quote from: (.+?) on/);
      if (userMatch) {
        result.quoted_users.push(userMatch[1]);
      }
    }

    const quoteContent = extractTextContent(element);
    if (quoteContent) {
      result.quotes.push(quoteContent);
    }

    element.find('> .quote').each((_, nestedQuote) => {
      processQuote($(nestedQuote));
    });
  }

  $('.quote').each((_, quote) => {
    if ($(quote).parent().hasClass('quote') || $(quote).prev('.quoteheader').children('a').length === 0) return;
    processQuote($(quote));
  });

  result.content = extractTextContent($('body'));

  $('.quoteheader').each((_, element) => {
    if ($(element).children('a').length > 0) {
      const elementText = $(element.next).text();
      result.content = result.content.replace(elementText, '');
    }
  });

  result.content = result.content.trim();
  result.quoted_users = [...new Set(result.quoted_users)];

  return result;
}

Of course broken quotes can't be easily treated, maybe with AI which is too out of my league.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
December 01, 2024, 08:41:25 AM
#72
OP - have you developed a way to logically parse through each post and assign quotes to the proper person?   Because it's open input, people can modify it to anything, so a well organized system is necessary.    It's bugging the heck out of me, been working on it for two days now.  :/

I'm asking because I know your attention is focused on another project atm, so hopefully you'd share.   Grin

Currently I'm stripping out the quotes. Also code blocks. Maybe I will make a second run to extract those and assign them as post metadata somehow.

I'll have to figure out a way to handle nested quotes. And many quotes in the same reply.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
November 30, 2024, 01:15:54 AM
#71
OP - have you developed a way to logically parse through each post and assign quotes to the proper person?   Because it's open input, people can modify it to anything, so a well organized system is necessary.    It's bugging the heck out of me, been working on it for two days now.  :/

I'm asking because I know your attention is focused on another project atm, so hopefully you'd share.   Grin
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
November 26, 2024, 08:57:52 PM
#70
Anyone know any providers that easily let me install Arch Linux? Or failing that, bare metal servers.

https://aws.amazon.com/about-aws/whats-new/2023/10/new-amazon-ec2-bare-metal-instances/

Again, AWS gives $300 - $1300 to new accounts.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
November 25, 2024, 05:05:29 AM
#69
Anyone know any providers that easily let me install Arch Linux? Or failing that, bare metal servers.
I checked the 3 providers where I have accounts (RamNode, HostVDS and RackNerd), and none of them have Arch Linux as a default choice. If you don't mind me asking: why Arch Linux? Would uploading your own ISO be an option? I've never tried it, but RamNode supports it.
legendary
Activity: 1512
Merit: 7340
Farewell, Leo
November 25, 2024, 04:28:10 AM
#68
Welp, it looks like my IP was rate-limited immediately after, according to logs. Time to spin up a new server Smiley
Do you mean that your IP is blocked from Cloudflare? If that's so, why don't you use a VPN?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
November 24, 2024, 10:51:05 PM
#67
My scraper managed to crawl half of the Wall Observer topic before finally being defeated by Cloudflare. It ingested the pages in a bit over a day.

Still, an impressive achievement, considering this is the longest topic on the the website, by a wide margin.

I'm going to add the capability of checking for new messages on a thread sooner or later.



Welp, it looks like my IP was rate-limited immediately after, according to logs. Time to spin up a new server Smiley

Anyone know any providers that easily let me install Arch Linux? Or failing that, bare metal servers.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
October 31, 2024, 03:35:04 AM
#66
Except that Google has even stronger anti-bot protections than Cloudflare - they make you solve the impossible reCaptcha if you search from a VPN address. Bypassing this with manual human workers a la 2captcha costs precious BTC.

AWS will give you a free EC2 instance to run your parser on.   Personally, I run two parsers on two instances (Europe and America) that access a central database controlling the frequency.  Costs me about $5/month.

If you don't need to login, you could use your AWS $300 credit to make thirty parsers, each hitting the forum once per second.  Do that for a month and you can reduce that to the free tier to stay up to date.

Smiley
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 31, 2024, 03:13:40 AM
#65
See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.
After re-reading this: are you sure it's Cloudflare's doing? Try setting a 1 second delay, because the rate-limiting was in place long before the forum used Cloudflare.

I have no idea to be honest. But I don't want to try to diagnose this, as debugging these kind of performance issues tend to be very non-reproducible and frustrating.

Well, that explains why it didn't work the first time that i made the search, it didn't show any result, but after looking the same terms in other browsers then google decided to show the search result. And that's why i put the result with other search engines, maybe those are the right tools for a search engine.

We could write a tool that calls those engines' API, and make the same search in all those engines giving back the forum links. That would be a cool tool.

Except that Google has even stronger anti-bot protections than Cloudflare - they make you solve the impossible reCaptcha if you search from a VPN address. Bypassing this with manual human workers a la 2captcha costs precious BTC.
legendary
Activity: 3346
Merit: 3130
October 21, 2024, 08:44:53 AM
#64
GOOGLE:
Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"
That worked very well when Google still had their "don't be evil" approach. Nowadays, Google only shows what they want you to see: Google partial blackholing of Bitcointalk?

Well, that explains why it didn't work the first time that i made the search, it didn't show any result, but after looking the same terms in other browsers then google decided to show the search result. And that's why i put the result with other search engines, maybe those are the right tools for a search engine.

We could write a tool that calls those engines' API, and make the same search in all those engines giving back the forum links. That would be a cool tool.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 09:31:43 AM
#63
GOOGLE:
Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"
That worked very well when Google still had their "don't be evil" approach. Nowadays, Google only shows what they want you to see: Google partial blackholing of Bitcointalk?

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.
After re-reading this: are you sure it's Cloudflare's doing? Try setting a 1 second delay, because the rate-limiting was in place long before the forum used Cloudflare.
legendary
Activity: 3346
Merit: 3130
October 17, 2024, 08:40:13 AM
#62
It has been months since you started this project mate, and after reading the threads i ask my self about the approach... Do we really need to download the full forum to have a good searching tool?

I don't think so, we can use the search engines online with the right commands to search for the right thread, let me show you how:

GOOGLE:

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

YAHOO:

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

DuckDuckGo

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

Yandex

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

Bing

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"


If we know how to use the search engines we should be able to find anything... Let me share some codes for your search:

Quote
Quotation Marks    Used to search for an exact phrase or sequence of words
Minus Sign (-)    Excludes specific words or phrases from search results
Asterisk (*)    Acts as a wildcard to represent any word or phrase in a search
Double Dots (..)    Used for number range searches
Site:    Restricts search results to a specific site or domain
Define:    Provides definitions of terms
Filetype:    Filters results by specific file type
Related:    Displays sites similar to the specified web page
Cache:    Shows the cached version of a web page
Link:    Finds pages that link to the specified URL
Inurl:    Searches for terms in the URL of web pages
Allinurl:    Searches for all terms in the URL of web pages
Intitle:    Searches for terms in the title of web pages
Allintitle:    Searches for all terms in the title of web pages
Intext:    Searches for terms in the body of web pages
Time:    Shows current time in various locations
Weather:    Shows weather conditions and forecasts for a location
Stocks:    Shows stock information
Info:    Displays some information that Google has about a web page
Book:    Find information about books
Phonebook:    Finds phone numbers
Movie:    Find information about movies
Area code:    Searches for the area code of a location
Currency:    Converts one currency to another
~    Used to include synonyms or similar terms in a search
AROUND(X)    Searches for words within X words of each other
City1 City2    Searches for pages containing both cities
Author:    Searches for content by a specific author
Source:    Finds news articles from a specific source
Map:    Shows maps related to the search query
Daterange:    Searches within a specific date range
Safesearch:    Filters out explicit content from search results
Music:    Find music information
Patent:    Searches for patents
Clinical trials:    Finds information on clinical trials

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 04:54:27 AM
#61
Suggestion: use my download (from years ago), and keep the posts that have been changed or removed. It would be nice if you can make it optional to search only the most recent version, or also older version of all posts.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 04:12:13 AM
#60
And another thing I've just realized - I haven't reached any mixer threads yet, but now that they are all replaced with "[banned mixer]", nobody is going to find a lot of such topics when searching by name (or URL).

So maybe I might introduce some sort of tagging system and just manually add tags to some posts (yes, posts not threads) so users can at least find what they are looking for properly.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 03:56:12 AM
#59
See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.
When I click all, I have to click "Verify" from Cloudflare. That indeed means they're "tougher" than usual.

Quote
Cloudflare would automatically block my IP address with a 1015 code if rate limits were exceeded (If I ever do get rate limited, there is an exponential backoff in place), however in this case, it looks like I have tripped an anti-DDoS measure placed on the forum's web server.
I just tried this oneliner:
Code:
i=5503125; time while test $i -le 5503134; do wget "https://bitcointalk.org/index.php?topic=$i.0" -O $i.html; sleep 1; i=$((i+1)); done
It took 13 seconds on my whitelisted server, and 12 seconds on a server that isn't whitelisted. Both include 9 seconds "sleep".
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 03:44:56 AM
#58
It looks like you use a 10 second delay between requests, is that correct? Why not reduce it to 1 second until it's done?

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.

Cloudflare would automatically block my IP address with a 1015 code if rate limits were exceeded (If I ever do get rate limited, there is an exponential backoff in place), however in this case, it looks like I have tripped an anti-DDoS measure placed on the forum's web server.
Pages:
Jump to: