Author

Topic: Bitcointalk Search Project (Read 974 times)

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
December 20, 2024, 01:53:41 AM
#77
Quote
I don't know if that's going to be fast or not
Alternatively, you could just run a cronjob with rsync on one of the servers, regularly fetching all updates.

I like this option. No changes to my script required and only a small administrative addition to the server.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
December 20, 2024, 01:35:14 AM
#76
I'd like to make it export results to my old VPS's filesystem, using something like SSHFS.
Before my dedicated server disappeared, I used to mount it's directories on another server through sshfs. That worked fine and didn't disconnect as long as both servers were running.

Quote
I don't know if that's going to be fast or not
Alternatively, you could just run a cronjob with rsync on one of the servers, regularly fetching all updates.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
December 20, 2024, 01:31:05 AM
#75
I have a new VPS now, but I'd like to make it export results to my old VPS's filesystem, using something like SSHFS.

Currently I don't exactly have that implemented yet but it wouldn't hurt to try. I don't know if that's going to be fast or not, bu it's just one file per second and both of the servers have gigabit lines.

The scraper hasn't been running for a while, so now might be a good time for me to try it.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
December 01, 2024, 03:58:30 PM
#74
Of course broken quotes can't be easily treated, maybe with AI which is too out of my league.

I wrote one of the first PPC search engines after goto/google, (sold it to a US corp in 2000) and it was tough back then just to determine the actual keywords.  This idea is a lot more difficult due to multiple sources in the same post and all the modern language sets.  I'm giving up on building any kind of search engine on my data, other than the basic SQL queries.   I'll use my data for something interesting and useful that has not been done yet.  Smiley
legendary
Activity: 2758
Merit: 6830
December 01, 2024, 09:27:58 AM
#73
Currently I'm stripping out the quotes. Also code blocks. Maybe I will make a second run to extract those and assign them as post metadata somehow.

I'll have to figure out a way to handle nested quotes. And many quotes in the same reply.
A second run seems a waste of time and resources.

For sure there should be some untested cases, but I've been getting good results with this:

Code:
type PostContent = {
  raw_content: string;
  content: string;
  quoted_users: string[];
  quotes: string[];
};

function extractPostContent(html: string): PostContent {
  const $ = load(html);

  const result: PostContent = {
    raw_content: html,
    content: '',
    quoted_users: [],
    quotes: []
  };

  function extractTextContent(element: cheerio.Cheerio): string {
    return element
      .clone()
      .children('br')
      .each((_, el) => {
        $(el).replaceWith(' ');
      })
      .end()
      .children('.quoteheader')
      .each((_, el) => {
        if ($(el).children('a').length > 0) {
          $(el.next).remove();
        }
        $(el).text(' ');
      })
      .end()
      .text()
      .trim();
  }

  function processQuote(element: cheerio.Cheerio) {
    const quoteHeader = element.prev('.quoteheader');
    if (quoteHeader.length) {
      const userMatch = quoteHeader.text().match(/Quote from: (.+?) on/);
      if (userMatch) {
        result.quoted_users.push(userMatch[1]);
      }
    }

    const quoteContent = extractTextContent(element);
    if (quoteContent) {
      result.quotes.push(quoteContent);
    }

    element.find('> .quote').each((_, nestedQuote) => {
      processQuote($(nestedQuote));
    });
  }

  $('.quote').each((_, quote) => {
    if ($(quote).parent().hasClass('quote') || $(quote).prev('.quoteheader').children('a').length === 0) return;
    processQuote($(quote));
  });

  result.content = extractTextContent($('body'));

  $('.quoteheader').each((_, element) => {
    if ($(element).children('a').length > 0) {
      const elementText = $(element.next).text();
      result.content = result.content.replace(elementText, '');
    }
  });

  result.content = result.content.trim();
  result.quoted_users = [...new Set(result.quoted_users)];

  return result;
}

Of course broken quotes can't be easily treated, maybe with AI which is too out of my league.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
December 01, 2024, 08:41:25 AM
#72
OP - have you developed a way to logically parse through each post and assign quotes to the proper person?   Because it's open input, people can modify it to anything, so a well organized system is necessary.    It's bugging the heck out of me, been working on it for two days now.  :/

I'm asking because I know your attention is focused on another project atm, so hopefully you'd share.   Grin

Currently I'm stripping out the quotes. Also code blocks. Maybe I will make a second run to extract those and assign them as post metadata somehow.

I'll have to figure out a way to handle nested quotes. And many quotes in the same reply.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
November 30, 2024, 01:15:54 AM
#71
OP - have you developed a way to logically parse through each post and assign quotes to the proper person?   Because it's open input, people can modify it to anything, so a well organized system is necessary.    It's bugging the heck out of me, been working on it for two days now.  :/

I'm asking because I know your attention is focused on another project atm, so hopefully you'd share.   Grin
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
November 26, 2024, 08:57:52 PM
#70
Anyone know any providers that easily let me install Arch Linux? Or failing that, bare metal servers.

https://aws.amazon.com/about-aws/whats-new/2023/10/new-amazon-ec2-bare-metal-instances/

Again, AWS gives $300 - $1300 to new accounts.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
November 25, 2024, 05:05:29 AM
#69
Anyone know any providers that easily let me install Arch Linux? Or failing that, bare metal servers.
I checked the 3 providers where I have accounts (RamNode, HostVDS and RackNerd), and none of them have Arch Linux as a default choice. If you don't mind me asking: why Arch Linux? Would uploading your own ISO be an option? I've never tried it, but RamNode supports it.
legendary
Activity: 1512
Merit: 7340
Farewell, Leo
November 25, 2024, 04:28:10 AM
#68
Welp, it looks like my IP was rate-limited immediately after, according to logs. Time to spin up a new server Smiley
Do you mean that your IP is blocked from Cloudflare? If that's so, why don't you use a VPN?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
November 24, 2024, 10:51:05 PM
#67
My scraper managed to crawl half of the Wall Observer topic before finally being defeated by Cloudflare. It ingested the pages in a bit over a day.

Still, an impressive achievement, considering this is the longest topic on the the website, by a wide margin.

I'm going to add the capability of checking for new messages on a thread sooner or later.



Welp, it looks like my IP was rate-limited immediately after, according to logs. Time to spin up a new server Smiley

Anyone know any providers that easily let me install Arch Linux? Or failing that, bare metal servers.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
October 31, 2024, 03:35:04 AM
#66
Except that Google has even stronger anti-bot protections than Cloudflare - they make you solve the impossible reCaptcha if you search from a VPN address. Bypassing this with manual human workers a la 2captcha costs precious BTC.

AWS will give you a free EC2 instance to run your parser on.   Personally, I run two parsers on two instances (Europe and America) that access a central database controlling the frequency.  Costs me about $5/month.

If you don't need to login, you could use your AWS $300 credit to make thirty parsers, each hitting the forum once per second.  Do that for a month and you can reduce that to the free tier to stay up to date.

Smiley
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 31, 2024, 03:13:40 AM
#65
See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.
After re-reading this: are you sure it's Cloudflare's doing? Try setting a 1 second delay, because the rate-limiting was in place long before the forum used Cloudflare.

I have no idea to be honest. But I don't want to try to diagnose this, as debugging these kind of performance issues tend to be very non-reproducible and frustrating.

Well, that explains why it didn't work the first time that i made the search, it didn't show any result, but after looking the same terms in other browsers then google decided to show the search result. And that's why i put the result with other search engines, maybe those are the right tools for a search engine.

We could write a tool that calls those engines' API, and make the same search in all those engines giving back the forum links. That would be a cool tool.

Except that Google has even stronger anti-bot protections than Cloudflare - they make you solve the impossible reCaptcha if you search from a VPN address. Bypassing this with manual human workers a la 2captcha costs precious BTC.
legendary
Activity: 3346
Merit: 3130
October 21, 2024, 08:44:53 AM
#64
GOOGLE:
Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"
That worked very well when Google still had their "don't be evil" approach. Nowadays, Google only shows what they want you to see: Google partial blackholing of Bitcointalk?

Well, that explains why it didn't work the first time that i made the search, it didn't show any result, but after looking the same terms in other browsers then google decided to show the search result. And that's why i put the result with other search engines, maybe those are the right tools for a search engine.

We could write a tool that calls those engines' API, and make the same search in all those engines giving back the forum links. That would be a cool tool.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 09:31:43 AM
#63
GOOGLE:
Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"
That worked very well when Google still had their "don't be evil" approach. Nowadays, Google only shows what they want you to see: Google partial blackholing of Bitcointalk?

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.
After re-reading this: are you sure it's Cloudflare's doing? Try setting a 1 second delay, because the rate-limiting was in place long before the forum used Cloudflare.
legendary
Activity: 3346
Merit: 3130
October 17, 2024, 08:40:13 AM
#62
It has been months since you started this project mate, and after reading the threads i ask my self about the approach... Do we really need to download the full forum to have a good searching tool?

I don't think so, we can use the search engines online with the right commands to search for the right thread, let me show you how:

GOOGLE:

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

YAHOO:

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

DuckDuckGo

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

Yandex

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

Bing

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"


If we know how to use the search engines we should be able to find anything... Let me share some codes for your search:

Quote
Quotation Marks    Used to search for an exact phrase or sequence of words
Minus Sign (-)    Excludes specific words or phrases from search results
Asterisk (*)    Acts as a wildcard to represent any word or phrase in a search
Double Dots (..)    Used for number range searches
Site:    Restricts search results to a specific site or domain
Define:    Provides definitions of terms
Filetype:    Filters results by specific file type
Related:    Displays sites similar to the specified web page
Cache:    Shows the cached version of a web page
Link:    Finds pages that link to the specified URL
Inurl:    Searches for terms in the URL of web pages
Allinurl:    Searches for all terms in the URL of web pages
Intitle:    Searches for terms in the title of web pages
Allintitle:    Searches for all terms in the title of web pages
Intext:    Searches for terms in the body of web pages
Time:    Shows current time in various locations
Weather:    Shows weather conditions and forecasts for a location
Stocks:    Shows stock information
Info:    Displays some information that Google has about a web page
Book:    Find information about books
Phonebook:    Finds phone numbers
Movie:    Find information about movies
Area code:    Searches for the area code of a location
Currency:    Converts one currency to another
~    Used to include synonyms or similar terms in a search
AROUND(X)    Searches for words within X words of each other
City1 City2    Searches for pages containing both cities
Author:    Searches for content by a specific author
Source:    Finds news articles from a specific source
Map:    Shows maps related to the search query
Daterange:    Searches within a specific date range
Safesearch:    Filters out explicit content from search results
Music:    Find music information
Patent:    Searches for patents
Clinical trials:    Finds information on clinical trials

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 04:54:27 AM
#61
Suggestion: use my download (from years ago), and keep the posts that have been changed or removed. It would be nice if you can make it optional to search only the most recent version, or also older version of all posts.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 04:12:13 AM
#60
And another thing I've just realized - I haven't reached any mixer threads yet, but now that they are all replaced with "[banned mixer]", nobody is going to find a lot of such topics when searching by name (or URL).

So maybe I might introduce some sort of tagging system and just manually add tags to some posts (yes, posts not threads) so users can at least find what they are looking for properly.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 03:56:12 AM
#59
See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.
When I click all, I have to click "Verify" from Cloudflare. That indeed means they're "tougher" than usual.

Quote
Cloudflare would automatically block my IP address with a 1015 code if rate limits were exceeded (If I ever do get rate limited, there is an exponential backoff in place), however in this case, it looks like I have tripped an anti-DDoS measure placed on the forum's web server.
I just tried this oneliner:
Code:
i=5503125; time while test $i -le 5503134; do wget "https://bitcointalk.org/index.php?topic=$i.0" -O $i.html; sleep 1; i=$((i+1)); done
It took 13 seconds on my whitelisted server, and 12 seconds on a server that isn't whitelisted. Both include 9 seconds "sleep".
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 03:44:56 AM
#58
It looks like you use a 10 second delay between requests, is that correct? Why not reduce it to 1 second until it's done?

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.

Cloudflare would automatically block my IP address with a 1015 code if rate limits were exceeded (If I ever do get rate limited, there is an exponential backoff in place), however in this case, it looks like I have tripped an anti-DDoS measure placed on the forum's web server.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 03:28:12 AM
#57
Due to rate limiting constraints, it can only navigate to about 2.5k
It looks like you use a 10 second delay between requests, is that correct? Why not reduce it to 1 second until it's done?
At 70k threads per month, it's going to take you 6.5 years to scrape all 5.5 million threads. I remember it took me about 4 months scraping all the old posts, and that was only up to 2019. If you go from 10 to 1 second per request, it still takes 6 months.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 03:11:29 AM
#56
Those numbers still don't add up. Less than one post per thread?

Sorry, I keep instinctively writing 'post' instead of 'thread'. Just %s/post/thread/g if you know what I mean Smiley

So basically, my scraper is built in terms of topics. It takes a range of topic numbers to scrape, and works through each of them one by one, continuously clicking on the Next Page button in the process until it's on the last page.

Like this, I scan scrape 20 posts at once per page load. Though I really wish I could display more of them - that would make the process considerably faster.

About 1/3 or so of the topics I scraped don't exist, are deleted, quarantined, or nuked for whatever reason so the parser runs very fast in those cases. But for the ones that do exist, usually there are 1-2 pages only per topic, which is what the majority of running time the scraper is spent on.

Occasionally I come across 10+ or 50+ page topics. But just having a lot of pages is not going to pull the average number of scraped topics/day down unless there's hundreds of pages in the topic. Due to rate limiting constraints, it can only navigate to about 2.5k pages per day. Including the 'Not Found' pages as well.

Here is a typical scrape in progress (this is only the log file. I actually built a whole user interface around this):

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 02:46:49 AM
#55
My bad guys, I meant to write 100K threadsGrin  Bad typo.

But at the rate this is going, it's going to be scraping like 70k posts a month.
Those numbers still don't add up. Less than one post per thread?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 02:28:01 AM
#54
I crossed 100K posts scraped a few days ago.
If you're doing this at 1 request per second, and started 3 months ago, that's almost 8 million requests. There are up to 20 posts per page. Did you mean 100M posts?

My bad guys, I meant to write 100K threadsGrin  Bad typo.

But at the rate this is going, it's going to be scraping like 70k posts a month. So I better start processing that archive you gave me.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 17, 2024, 02:18:05 AM
#53
I crossed 100K posts scraped a few days ago.
If you're doing this at 1 request per second, and started 3 months ago, that's almost 8 million requests. There are up to 20 posts per page. Did you mean 100M posts?
legendary
Activity: 1512
Merit: 7340
Farewell, Leo
October 17, 2024, 02:02:07 AM
#52
I crossed 100K posts scraped a few days ago.
Wouldn't it take years until you reach the 60 millionth posts?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 17, 2024, 01:46:08 AM
#51
I crossed 100K posts threads scraped a few days ago.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
October 07, 2024, 05:06:38 AM
#50
I can't figure out how one would effectively make a search by an image.
I'd say just search the text:
Code:
[img]http://www.anita-calculators.info/assets/images/Anita841_5.jpg[/img]
So a search for "Anita calculators" should pop up this post (Ninjastic can't find it), but also "Anita841_5" or better "Anita841". Anyone searching for the word "images" is on his own, but finding the right posts when you search for "assets" is probably going to be a challenge.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 05, 2024, 11:31:04 PM
#49
Damn, I just realized that threads like this with only images on them are unscrapeable.

As in, the topic is indexed, but since there is no text, the message contents are all empty.

I can't figure out how one would effectively make a search by an image.

On a side-note, I am approaching Wall Observer-sized threads. Let's see if my scraper can grok them. It can swallow threads with a thousand or so pages, but I've never tested with tens of thousands.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 01, 2024, 10:15:46 AM
#48
In all seriousness though, I'm not in favor of exposing SQL directly, not only because of this:

'Tis a good idea - but beware SQL injection!

But also because it would be a huge performance issue very fast. Captchas will not protect the database much, as there are bot-solvers you can rent for sats per hour.

At any rate, the data was always going to be placed in an Elasticsearch cluster, not an SQL database. Searching inside paragraphs using SELECT is so difficult that it might not even be possible.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
September 29, 2024, 07:39:43 PM
#47
Just allow us to run any SELECT queries we want and you will have covered every potential feature!

No seriously. Just load the database with all the posts, and allow the user to execute any SELECT query they want and you will have covered everything we might ask. Maybe add a nice UI for basic search requests, like author and subject search, but beyond that...

Code:
CREATE USER 'readonly'@'localhost' IDENTIFIED BY 'pass';
GRANT SELECT ON bitcointalk_database.* TO 'readonly'@'localhost';
FLUSH PRIVILEGES;

 Grin

'Tis a good idea - but beware SQL injection!
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
September 28, 2024, 06:16:32 AM
#46
Scraping has resumed

(I am respecting Bitcointalk.org rate limits)
legendary
Activity: 1512
Merit: 7340
Farewell, Leo
September 21, 2024, 03:09:28 PM
#45
Just allow us to run any SELECT queries we want and you will have covered every potential feature!

No seriously. Just load the database with all the posts, and allow the user to execute any SELECT query they want and you will have covered everything we might ask. Maybe add a nice UI for basic search requests, like author and subject search, but beyond that...

Code:
CREATE USER 'readonly'@'localhost' IDENTIFIED BY 'pass';
GRANT SELECT ON bitcointalk_database.* TO 'readonly'@'localhost';
FLUSH PRIVILEGES;

 Grin
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
September 19, 2024, 02:26:09 AM
#44
Bump. I just got a new server for the scraper along with a bunch of HTTP proxies.

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).
The rules are the same as for humans. But keep in mind:
- No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)

This bold part should solve my problem with endless Cloudflare captchas. Besides, the parser is bound to take breaks once I catch up with the database downloads and it is limited to checking for new posts and edits to old posts. I wish there was a mod log that contained edited post events.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
August 06, 2024, 05:52:46 PM
#43
Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).

I use multiple parsers for backup - if one goes down for whatever reason a second one can take over.  90% of the time, my parsers have nothing to do, since I'm not parsing every profile like I did with BPIP.  I parse once every ten seconds to check for any new posts, and if any I parse them.  My record locking system has a parse delay for many things to prevent it from hitting bct too often.  I don't even parse as a logged in user.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
August 06, 2024, 10:14:09 AM
#42
My scraper was broken by Cloudflare after about 58K posts or so.
If you ask nicely, maybe theymos can whitelist your server IP in Cloudflare. That solved my download problems when Cloudflare goes in full DDoS protection mode.

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).
The rules are the same as for humans. But keep in mind:
- No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 28, 2024, 01:28:13 AM
#41
I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.

You are still thinking of ONE parser going out pretending to be another parser.  You are fighting against every fraud detection tool out there.  

Create a schedule table in your database.   Columns include jobid, lockid, lastjob and parsedelay.   When your parser grabs a job, it locks it in the table so the next parser will grab a different job.   It releases the lock when it finishes.   Your parser can call the first record in the schedule based on (lastjob+parsedelay) where lockid is free.

Edit:  Then go to one of the cloud providers and use a free service to create a second parser.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 28, 2024, 01:22:18 AM
#40
I explained how to do this five posts ago.... Smiley    ^^^

Wow you're fast  Grin

You mean this?

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 28, 2024, 01:19:15 AM
#39
My scraper was broken by Cloudflare after about 58K posts or so. I can use proxies to get around it (while keeping the same rate limit), but I will need to figure out how to integrate that into the code.

I explained how to do this five posts ago.... Smiley    ^^^
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 28, 2024, 01:18:15 AM
#38
My scraper was broken by Cloudflare after about 58K posts or so. I can use proxies to get around it (while keeping the same rate limit), but I will need to figure out how to integrate that into the code.

I do however have LoyceV's archive (thanks Loyce) But I am not sure whether it covers posts before 2018.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 20, 2024, 02:42:12 PM
#37
There's no option to ask Theymos to offer - in a way or another - the list of "last" edited posts (maybe after a date)?
I mean that the forum knows them, it only has to offer them somehow.

The only reason he couldn't do this would be if the indexing would slow down the site.   I wouldn't imagine any native SMF code that searches based on that key.  :/
legendary
Activity: 3668
Merit: 6382
Looking for campaign manager? Contact icopress!
July 20, 2024, 02:10:16 PM
#36
Edit: I will implement a heuristic that tracks when the last time a person logged into their account, and correlates that to the frequency of posts they made on particular days versus now, Banned users and inactive accounts that haven't posted for 120 days (or whatever the "This account recently woke up from a long period of inactivity" threshold is) can be excluded, so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

There's no option to ask Theymos to offer - in a way or another - the list of "last" edited posts (maybe after a date)?
I mean that the forum knows them, it only has to offer them somehow.
legendary
Activity: 1820
Merit: 2700
Crypto Swap Exchange
July 20, 2024, 01:53:43 PM
#35
It's not perfect obviously - edited posts probably won't be indexed for several hours like that* - but it's the best I can think of.

I also think that it won't be feasible, at least not with a single scraper. As Vod said, his post history exceeds 21,000 entries. Continuously monitoring all his posts for edits would be very resource-intensive and/or time-consuming. And what about other members with even more activity, like philipma1957, BADecker, JayJuanGee, franky1, and others? We're talking hundreds of thousands of posts that would need parsing every day...
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 20, 2024, 01:11:28 PM
#34
That's what I'm saying. I will have to manually go through all the user's posts (but hey, at least I'll have their message and topic IDs this time!) but at least I know that statistically, I will only need to search a few users at a time, because editing is infrequent compared to posting.

Manual processes will cause this project to fail - too much data.

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 20, 2024, 12:56:48 PM
#33
There is no way for you to track which post I may edit.  You would need to rescan all my 22k posts on a regular basis, or use some AI to determine what I may have edited.  The number of seconds in a day is limited to about 4x my number of posts, and I'm sure there are more than three other people posting.

That's what I'm saying. I will have to manually go through all the user's posts (but hey, at least I'll have their message and topic IDs this time!) but at least I know that statistically, I will only need to search a few users at a time, because editing is infrequent compared to posting.

It's not perfect obviously - edited posts probably won't be indexed for several hours like that* - but it's the best I can think of.

* When the initial download is finished
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 20, 2024, 12:45:13 PM
#32
so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

There is no way for you to track which post I may edit.  You would need to rescan all my 22k posts on a regular basis, or use some AI to determine what I may have edited.  The number of seconds in a day is limited to about 4x my number of posts, and I'm sure there are more than three other people posting.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 20, 2024, 12:03:19 PM
#31
We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?

This is going to be a live search engine, so every post is going to be kept up to date, and removed if the original post is removed as well.

Edit: I will implement a heuristic that tracks when the last time a person logged into their account, and correlates that to the frequency of posts they made on particular days versus now, Banned users and inactive accounts that haven't posted for 120 days (or whatever the "This account recently woke up from a long period of inactivity" threshold is) can be excluded, so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

The number of new posts daily >>> The number of edited posts daily

*I do not currently track the "last edited" time because it is an unreliable indicator for determining whether a given post might be edited in the future.
legendary
Activity: 2758
Merit: 6830
July 20, 2024, 09:25:56 AM
#30
We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?
There is no easy solution for that. It's impossible to keep track of 55 million posts.
legendary
Activity: 3346
Merit: 3130
July 20, 2024, 08:32:42 AM
#29
I'm curious about the method that you are using to get the data from each thread

You understand how we get the info - just not how we know what info to get.

It depends on what data you want to process.  If you just want to capture each thread, parse the front page every 10 seconds and grab the link from the "Newest Posts" area. 

Ok, then what would happen if a user edited his post once it was recorded in the search engine Database? That way will be impossible to search for new information on edited posts, and i feel like that will be a problem for this project.

We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
July 19, 2024, 09:45:52 AM
#28
In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.
Is that a compressed tarball?
Yes.

Quote
Because for my daily backups I usually tar my folders without compression to make it go many times faster.
I'm using pigz, which now only uses 1% of 1 CPU core. Reading from disk is the limitation. I thought I'd do you a favour by making one file instead of giving you half a million compressed files.

Quote
I'm considering putting together a website and releasing the search, but with incomplete data for now. So far, I have over 23k posts (up to roughly July 2011), but I would not like to wait months before people can actually use this.
By all means: use my data for this Smiley Only a few posts are censored by me. Other than that, the file format is pretty much the same everywhere.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 19, 2024, 07:12:29 AM
#27
In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.

Is that a compressed tarball? Because for my daily backups I usually tar my folders without compression to make it go many times faster.



I'm considering putting together a website and releasing the search, but with incomplete data for now. So far, I have over 23k posts (up to roughly July 2011), but I would not like to wait months before people can actually use this.

It would also give me time to figure out the best setup for this.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
July 19, 2024, 12:56:46 AM
#26
For now, I am scraping topics from the forum using my bot.
If it helps, I can give you a tar.gz copy of my data
Sure, you can send me a copy by PM.
In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 18, 2024, 11:01:23 PM
#25
I'm curious about the method that you are using to get the data from each thread

You understand how we get the info - just not how we know what info to get.

It depends on what data you want to process.  If you just want to capture each thread, parse the front page every 10 seconds and grab the link from the "Newest Posts" area. 
legendary
Activity: 3346
Merit: 3130
July 18, 2024, 10:46:24 PM
#24
I'm curious about the method that you are using to get the data from each thread, there is a command on Linux called lynx, it is a web browser for the command line, and with that, you can get the text from a website or the source code:

Code:
lynx --dump "https://bitcointalk.org/index.php?topic=5503125.0"

Code:
lynx --source "https://bitcointalk.org/index.php?topic=5503125.0"

You could use some tools like cut and grep to get only the relative data. Making the script would be the easy part, getting the data from 5.5 million of threads will be the hard part, lol And the fact that each thread could have multiple pages makes it a challenge.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 18, 2024, 02:59:36 PM
#23
That's a security risk though, since it would require me to also store my password in plain text.

Never store passwords in plain text.  Use a secrets manager, like AWS, to enter your password at runtime.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 18, 2024, 12:51:20 PM
#22
If you keep track of every new topic, all you need to do is add +1 to the ID and see if it exists. Some problems might arrive, like the thread being deleted before you check, meaning that the last ID has changed, but there are ways of minimizing this.

I could try checking the next sequence of 100 posts or so, in order to check for new posts, since it's extremely unlikely that they were all deleted.
legendary
Activity: 2758
Merit: 6830
July 18, 2024, 12:34:56 PM
#21
The problem is the moment to index the new threads, without RSS we have to "Ping" each board to know if there is new content.
You don't have to do that.

If you keep track of every new topic, all you need to do is add +1 to the ID and see if it exists. Some problems might arrive, like the thread being deleted before you check, meaning that the last ID has changed, but there are ways of minimizing this.

For my scraper I did more of a workaround checking the url of the most recent posts to see if there might be zero replies on its thread, which would imply it is an OP.
legendary
Activity: 3346
Merit: 3130
July 18, 2024, 09:09:58 AM
#20
if you are scraping and parsing anyway, it would be nice if your search engine was indexing the most common board objects... For example, the username, DT rank, feedback, boards,... That way, you could use keywords, like you can in google (filetype:, site:,...).

But this wouldn't be like rebuilding the full forum on a database?

I mean, there are 2 ways to do this:

1.- You take all the forum data, and put it together on a database and then your search engine makes calls to that database. But for this, you will have to live update that database or at least have a cron job to add the new data each x time.

2.- Search for the data directly on the site, but for that, you would have to do some kind of hack to the current search engine.

If you have other way in mind i would love to know how it work.

It would be better if there was a way to improve SMF's search function or query the relational database directly, but i don't know if Theymos would give anybody direct access to the database or allow anybody to completely rework smf's soucecode .
I don't think he would, and for good reason offcourse... It would require absolute trust in the person building the search engine.

But you're right, it would be completely rebuilding bitcointalk's database, like several other members are doing aswell (more or less)..

If anyone has access to implement a change like this in the forum, that one is our verified hacker PowerGlove (https://bitcointalksearch.org/user/powerglove-3486361), but the fact that the search function doesn't work fine at all must be for a reason. Maybe the forum used to have some kind of attacks from that vector.

This project would be easy if the RSS was still active on the forum, but sadly it has been removed:

https://bitcointalk.org/index.php?type=rss;action=.xml

Quote
action=.xml is disabled due to slowness. If you use this, write a post in Meta explaining your usage.

The problem is the moment to index the new threads, without RSS we have to "Ping" each board to know if there is new content.

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 18, 2024, 06:23:55 AM
#19
15,000 out of 1.7 million threads scraped so far, all topics being scraped in numerical order.

I can't index DT information since it's invisible to guests and it changes too quickly anyway, but I'm already scraping the other stuff like boards, username (of course) etcetera. Even the user IDs are being scraped to help deal with name changes.

You CAN index DT information by running your parser under your account.   See https://bitcointalk.org/captcha_code.php  (it no longer works for my account but yours is probably fine)

That's a security risk though, since it would require me to also store my password in plain text.

Even I use one of the bots (BotATether or Jarvis), the added hassle of dealing with authentication will actually slow down post collection. Currently for each thread I'm launching a new browser - this helps me stay within the rate limits.

I might have to do a separate scrape for users specifically, to get everybody's DT information without duplicating stuff. But I don't think that's going to be happening anytime soon.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 17, 2024, 03:41:48 PM
#18
so it will be a lot better than what I currently have on ninjastic.space.

I watch your work with interest - I love large datasets and you seem to know the presentation layer well!

And then there is LoyceV - he has all the information in text format and is a whiz with queries, but he cannot present the info like you do.  Loyce.club can be the fastest to find exactly what one is was looking for, but I work with your website if I only have partial info. 
legendary
Activity: 2758
Merit: 6830
July 17, 2024, 02:32:01 PM
#17
You must have some insane AIish parsing going.   Many posts have broken quote html.  Sad
I don't, in this case there isn't much to be done... If it's broken, it's broken.

But most posts don't have this problem, so it will be a lot better than what I currently have on ninjastic.space.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 17, 2024, 02:28:12 PM
#16
I can't index DT information since it's invisible to guests and it changes too quickly anyway, but I'm already scraping the other stuff like boards, username (of course) etcetera. Even the user IDs are being scraped to help deal with name changes.

You CAN index DT information by running your parser under your account.   See https://bitcointalk.org/captcha_code.php  (it no longer works for my account but yours is probably fine)

Even if there are nested quotes, they are treated individually and also indexed as their own.

You will be able to search only the content, only the quotes, quotes from X user, or both the content and quotes.

You must have some insane AIish parsing going.   Many posts have broken quote html.  Sad
legendary
Activity: 2758
Merit: 6830
July 17, 2024, 12:24:50 PM
#15
Personally I am against indexing quotes inside posts, since that runs into the risk of corpus duplication, so gives the quoted parts more weight in the search results - meaning the posts that are quoted the most are also the most likely to be returned at the top (most likely in the form of some person's reply to it).
I've separated quotes from post content.

That way there is:

- post content
- quotes

There's also the issue with nested quotes, which is hell to deal with using a database, and even Elasticsearch too. It can lead to infinitely recursive schema/JSON before you can even parse it completely.
Even if there are nested quotes, they are treated individually and also indexed as their own.

Take this post of mine for example. There is the content (everything that is NOT a quote of another user, like this text itself) and both quotes from author NotATether you can see above.

You will be able to search only the content, only the quotes, quotes from X user, or both the content and quotes.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
July 17, 2024, 12:04:35 PM
#14
Personally I am against indexing quotes inside posts, since that runs into the risk of corpus duplication
In most cases, I agree. But if the quote comes from an external website, the content can still be relevant to the search query.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 17, 2024, 11:57:20 AM
#13
Ninjastic is showing entire posts so it's impossible to find anything meaningful when you search for a keyword.
Soon™ ... Wink



Personally I am against indexing quotes inside posts, since that runs into the risk of corpus duplication, so gives the quoted parts more weight in the search results - meaning the posts that are quoted the most are also the most likely to be returned at the top (most likely in the form of some person's reply to it).

There's also the issue with nested quotes, which is hell to deal with using a database, and even Elasticsearch too. It can lead to infinitely recursive schema/JSON before you can even parse it completely.
legendary
Activity: 2758
Merit: 6830
July 17, 2024, 11:36:44 AM
#12
Ninjastic is showing entire posts so it's impossible to find anything meaningful when you search for a keyword.
Soon™ ... Wink

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 17, 2024, 10:40:58 AM
#11
This thing is going to use Elasticsearch, so if I can figure out how to handle the multi-lingual text, it may possibly support searching in different languages too.
legendary
Activity: 3612
Merit: 5297
https://merel.mobi => buy facemasks with BTC/LTC
July 17, 2024, 12:34:14 AM
#10
if you are scraping and parsing anyway, it would be nice if your search engine was indexing the most common board objects... For example, the username, DT rank, feedback, boards,... That way, you could use keywords, like you can in google (filetype:, site:,...).

But this wouldn't be like rebuilding the full forum on a database?

I mean, there are 2 ways to do this:

1.- You take all the forum data, and put it together on a database and then your search engine makes calls to that database. But for this, you will have to live update that database or at least have a cron job to add the new data each x time.

2.- Search for the data directly on the site, but for that, you would have to do some kind of hack to the current search engine.

If you have other way in mind i would love to know how it work.

It would be better if there was a way to improve SMF's search function or query the relational database directly, but i don't know if Theymos would give anybody direct access to the database or allow anybody to completely rework smf's soucecode .
I don't think he would, and for good reason offcourse... It would require absolute trust in the person building the search engine.

But you're right, it would be completely rebuilding bitcointalk's database, like several other members are doing aswell (more or less)..

Whenever i see somebody building extensions, offsite tools, proposing changes to SMF, i can't help but wonder how epochtalk is doing, and if epochtalk would solve the problem without requiring browser plugins, offsite tools, scraping,... Don't get me wrong: the current forum software lacks several features, and i'm happy if somebody builds them (even if it's on a different domain, or requires me to install a browser plugin), i just wonder wether we'll ever switch to the new forum software.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 16, 2024, 01:16:47 PM
#9
it should not be looking inside quotes for keywords.

This is what I'm currently having issues with.  People break the BB quote code all the time in their posts.
legendary
Activity: 3346
Merit: 3130
July 16, 2024, 09:07:18 AM
#8
if you are scraping and parsing anyway, it would be nice if your search engine was indexing the most common board objects... For example, the username, DT rank, feedback, boards,... That way, you could use keywords, like you can in google (filetype:, site:,...).

But this wouldn't be like rebuilding the full forum on a database?

I mean, there are 2 ways to do this:

1.- You take all the forum data, and put it together on a database and then your search engine makes calls to that database. But for this, you will have to live update that database or at least have a cron job to add the new data each x time.

2.- Search for the data directly on the site, but for that, you would have to do some kind of hack to the current search engine.

If you have other way in mind i would love to know how it work.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 16, 2024, 07:03:39 AM
#7
For now, I am scraping topics from the forum using my bot.
If it helps, I can give you a tar.gz copy of my data (note: some posts are missing). I shared it with Ninjastic years ago, and it saves you several months of scraping. Freshly scraping will get you a more recent edit though, and less deleted posts.

Sure, you can send me a copy by PM.

if you are scraping and parsing anyway, it would be nice if your search engine was indexing the most common board objects... For example, the username, DT rank, feedback, boards,... That way, you could use keywords, like you can in google (filetype:, site:,...).

If would be nice if i could make a query like `user:Theymos board:Bitcoin\Project_Development +wallet -knots taproot` and i would only see posts made by Theymos in the project developent board that contained the word wallet, did not contain the word knots and hopefully contained the word taproot.

I can't index DT information since it's invisible to guests and it changes too quickly anyway, but I'm already scraping the other stuff like boards, username (of course) etcetera. Even the user IDs are being scraped to help deal with name changes.

My bot can also handle anonymous users too.
legendary
Activity: 3612
Merit: 5297
https://merel.mobi => buy facemasks with BTC/LTC
July 16, 2024, 05:49:26 AM
#6
if you are scraping and parsing anyway, it would be nice if your search engine was indexing the most common board objects... For example, the username, DT rank, feedback, boards,... That way, you could use keywords, like you can in google (filetype:, site:,...).

If would be nice if i could make a query like `user:Theymos board:Bitcoin\Project_Development +wallet -knots taproot` and i would only see posts made by Theymos in the project developent board that contained the word wallet, did not contain the word knots and hopefully contained the word taproot.
legendary
Activity: 2870
Merit: 7490
Crypto Swap Exchange
July 16, 2024, 05:19:21 AM
#5
Ninjastic is showing entire posts so it's impossible to find anything meaningful when you search for a keyword.

Sorry for not being specific. I mean feature such as "Date Range (UTC)" filter, choosing one or more boards (and optionally with the child board) and sign support (+, -, | and "").
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
July 16, 2024, 04:27:46 AM
#4
For now, I am scraping topics from the forum using my bot.
If it helps, I can give you a tar.gz copy of my data (note: some posts are missing). I shared it with Ninjastic years ago, and it saves you several months of scraping. Freshly scraping will get you a more recent edit though, and less deleted posts.

1. Sort by relevancy.
This would be the one thing I'd like to see, but also no doubt the most difficult one. Ninjastic often gives me a list of hundreds of posts. A good search engine (like Google 10 years ago) would show what I want to see first.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 16, 2024, 04:26:57 AM
#3
List all the features you want in a search engine here.

How about feature which already available on https://ninjastic.space/search? Aside from that, i would suggest these feature.
1. Sort by relevancy.
2. Showing message that the search keyword may contain typo (such as showing "bitcoin" when someone enter "bitcon").

Ninjastic is showing entire posts so it's impossible to find anything meaningful when you search for a keyword.

It needs to show only an excerpt with a link and title like forum search and Google do, it needs to have page numbers for browsing the results by page and most importantly it should not be looking inside quotes for keywords.
legendary
Activity: 2870
Merit: 7490
Crypto Swap Exchange
July 16, 2024, 03:46:29 AM
#2
List all the features you want in a search engine here.

How about feature which already available on https://ninjastic.space/search? Aside from that, i would suggest these feature.
1. Sort by relevancy.
2. Showing message that the search keyword may contain typo (such as showing "bitcoin" when someone enter "bitcon").
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 16, 2024, 03:01:35 AM
#1
I am trying to make a search engine for Bitcointalk posts, since Google and the built-in one are so bad.

List all the features you want in a search engine here.

For now, I am scraping topics from the forum using my bot. I made sure to identify the requests as coming from me in my program so that the admins know where this traffic is coming from.

It doesn't look like it's exceeding the threshold of one request per second so that's good.

Private boards are not being scraped. The scraping is being done as a guest.
Jump to: