Pages:
Author

Topic: Bitcointalk Search Project - page 3. (Read 774 times)

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
July 19, 2024, 12:56:46 AM
#26
For now, I am scraping topics from the forum using my bot.
If it helps, I can give you a tar.gz copy of my data
Sure, you can send me a copy by PM.
In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 18, 2024, 11:01:23 PM
#25
I'm curious about the method that you are using to get the data from each thread

You understand how we get the info - just not how we know what info to get.

It depends on what data you want to process.  If you just want to capture each thread, parse the front page every 10 seconds and grab the link from the "Newest Posts" area. 
legendary
Activity: 3346
Merit: 3125
July 18, 2024, 10:46:24 PM
#24
I'm curious about the method that you are using to get the data from each thread, there is a command on Linux called lynx, it is a web browser for the command line, and with that, you can get the text from a website or the source code:

Code:
lynx --dump "https://bitcointalk.org/index.php?topic=5503125.0"

Code:
lynx --source "https://bitcointalk.org/index.php?topic=5503125.0"

You could use some tools like cut and grep to get only the relative data. Making the script would be the easy part, getting the data from 5.5 million of threads will be the hard part, lol And the fact that each thread could have multiple pages makes it a challenge.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 18, 2024, 02:59:36 PM
#23
That's a security risk though, since it would require me to also store my password in plain text.

Never store passwords in plain text.  Use a secrets manager, like AWS, to enter your password at runtime.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 18, 2024, 12:51:20 PM
#22
If you keep track of every new topic, all you need to do is add +1 to the ID and see if it exists. Some problems might arrive, like the thread being deleted before you check, meaning that the last ID has changed, but there are ways of minimizing this.

I could try checking the next sequence of 100 posts or so, in order to check for new posts, since it's extremely unlikely that they were all deleted.
legendary
Activity: 2758
Merit: 6830
July 18, 2024, 12:34:56 PM
#21
The problem is the moment to index the new threads, without RSS we have to "Ping" each board to know if there is new content.
You don't have to do that.

If you keep track of every new topic, all you need to do is add +1 to the ID and see if it exists. Some problems might arrive, like the thread being deleted before you check, meaning that the last ID has changed, but there are ways of minimizing this.

For my scraper I did more of a workaround checking the url of the most recent posts to see if there might be zero replies on its thread, which would imply it is an OP.
legendary
Activity: 3346
Merit: 3125
July 18, 2024, 09:09:58 AM
#20
if you are scraping and parsing anyway, it would be nice if your search engine was indexing the most common board objects... For example, the username, DT rank, feedback, boards,... That way, you could use keywords, like you can in google (filetype:, site:,...).

But this wouldn't be like rebuilding the full forum on a database?

I mean, there are 2 ways to do this:

1.- You take all the forum data, and put it together on a database and then your search engine makes calls to that database. But for this, you will have to live update that database or at least have a cron job to add the new data each x time.

2.- Search for the data directly on the site, but for that, you would have to do some kind of hack to the current search engine.

If you have other way in mind i would love to know how it work.

It would be better if there was a way to improve SMF's search function or query the relational database directly, but i don't know if Theymos would give anybody direct access to the database or allow anybody to completely rework smf's soucecode .
I don't think he would, and for good reason offcourse... It would require absolute trust in the person building the search engine.

But you're right, it would be completely rebuilding bitcointalk's database, like several other members are doing aswell (more or less)..

If anyone has access to implement a change like this in the forum, that one is our verified hacker PowerGlove (https://bitcointalksearch.org/user/powerglove-3486361), but the fact that the search function doesn't work fine at all must be for a reason. Maybe the forum used to have some kind of attacks from that vector.

This project would be easy if the RSS was still active on the forum, but sadly it has been removed:

https://bitcointalk.org/index.php?type=rss;action=.xml

Quote
action=.xml is disabled due to slowness. If you use this, write a post in Meta explaining your usage.

The problem is the moment to index the new threads, without RSS we have to "Ping" each board to know if there is new content.

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 18, 2024, 06:23:55 AM
#19
15,000 out of 1.7 million threads scraped so far, all topics being scraped in numerical order.

I can't index DT information since it's invisible to guests and it changes too quickly anyway, but I'm already scraping the other stuff like boards, username (of course) etcetera. Even the user IDs are being scraped to help deal with name changes.

You CAN index DT information by running your parser under your account.   See https://bitcointalk.org/captcha_code.php  (it no longer works for my account but yours is probably fine)

That's a security risk though, since it would require me to also store my password in plain text.

Even I use one of the bots (BotATether or Jarvis), the added hassle of dealing with authentication will actually slow down post collection. Currently for each thread I'm launching a new browser - this helps me stay within the rate limits.

I might have to do a separate scrape for users specifically, to get everybody's DT information without duplicating stuff. But I don't think that's going to be happening anytime soon.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 17, 2024, 03:41:48 PM
#18
so it will be a lot better than what I currently have on ninjastic.space.

I watch your work with interest - I love large datasets and you seem to know the presentation layer well!

And then there is LoyceV - he has all the information in text format and is a whiz with queries, but he cannot present the info like you do.  Loyce.club can be the fastest to find exactly what one is was looking for, but I work with your website if I only have partial info. 
legendary
Activity: 2758
Merit: 6830
July 17, 2024, 02:32:01 PM
#17
You must have some insane AIish parsing going.   Many posts have broken quote html.  Sad
I don't, in this case there isn't much to be done... If it's broken, it's broken.

But most posts don't have this problem, so it will be a lot better than what I currently have on ninjastic.space.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 17, 2024, 02:28:12 PM
#16
I can't index DT information since it's invisible to guests and it changes too quickly anyway, but I'm already scraping the other stuff like boards, username (of course) etcetera. Even the user IDs are being scraped to help deal with name changes.

You CAN index DT information by running your parser under your account.   See https://bitcointalk.org/captcha_code.php  (it no longer works for my account but yours is probably fine)

Even if there are nested quotes, they are treated individually and also indexed as their own.

You will be able to search only the content, only the quotes, quotes from X user, or both the content and quotes.

You must have some insane AIish parsing going.   Many posts have broken quote html.  Sad
legendary
Activity: 2758
Merit: 6830
July 17, 2024, 12:24:50 PM
#15
Personally I am against indexing quotes inside posts, since that runs into the risk of corpus duplication, so gives the quoted parts more weight in the search results - meaning the posts that are quoted the most are also the most likely to be returned at the top (most likely in the form of some person's reply to it).
I've separated quotes from post content.

That way there is:

- post content
- quotes

There's also the issue with nested quotes, which is hell to deal with using a database, and even Elasticsearch too. It can lead to infinitely recursive schema/JSON before you can even parse it completely.
Even if there are nested quotes, they are treated individually and also indexed as their own.

Take this post of mine for example. There is the content (everything that is NOT a quote of another user, like this text itself) and both quotes from author NotATether you can see above.

You will be able to search only the content, only the quotes, quotes from X user, or both the content and quotes.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
July 17, 2024, 12:04:35 PM
#14
Personally I am against indexing quotes inside posts, since that runs into the risk of corpus duplication
In most cases, I agree. But if the quote comes from an external website, the content can still be relevant to the search query.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 17, 2024, 11:57:20 AM
#13
Ninjastic is showing entire posts so it's impossible to find anything meaningful when you search for a keyword.
Soon™ ... Wink



Personally I am against indexing quotes inside posts, since that runs into the risk of corpus duplication, so gives the quoted parts more weight in the search results - meaning the posts that are quoted the most are also the most likely to be returned at the top (most likely in the form of some person's reply to it).

There's also the issue with nested quotes, which is hell to deal with using a database, and even Elasticsearch too. It can lead to infinitely recursive schema/JSON before you can even parse it completely.
legendary
Activity: 2758
Merit: 6830
July 17, 2024, 11:36:44 AM
#12
Ninjastic is showing entire posts so it's impossible to find anything meaningful when you search for a keyword.
Soon™ ... Wink

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 17, 2024, 10:40:58 AM
#11
This thing is going to use Elasticsearch, so if I can figure out how to handle the multi-lingual text, it may possibly support searching in different languages too.
legendary
Activity: 3584
Merit: 5243
https://merel.mobi => buy facemasks with BTC/LTC
July 17, 2024, 12:34:14 AM
#10
if you are scraping and parsing anyway, it would be nice if your search engine was indexing the most common board objects... For example, the username, DT rank, feedback, boards,... That way, you could use keywords, like you can in google (filetype:, site:,...).

But this wouldn't be like rebuilding the full forum on a database?

I mean, there are 2 ways to do this:

1.- You take all the forum data, and put it together on a database and then your search engine makes calls to that database. But for this, you will have to live update that database or at least have a cron job to add the new data each x time.

2.- Search for the data directly on the site, but for that, you would have to do some kind of hack to the current search engine.

If you have other way in mind i would love to know how it work.

It would be better if there was a way to improve SMF's search function or query the relational database directly, but i don't know if Theymos would give anybody direct access to the database or allow anybody to completely rework smf's soucecode .
I don't think he would, and for good reason offcourse... It would require absolute trust in the person building the search engine.

But you're right, it would be completely rebuilding bitcointalk's database, like several other members are doing aswell (more or less)..

Whenever i see somebody building extensions, offsite tools, proposing changes to SMF, i can't help but wonder how epochtalk is doing, and if epochtalk would solve the problem without requiring browser plugins, offsite tools, scraping,... Don't get me wrong: the current forum software lacks several features, and i'm happy if somebody builds them (even if it's on a different domain, or requires me to install a browser plugin), i just wonder wether we'll ever switch to the new forum software.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
July 16, 2024, 01:16:47 PM
#9
it should not be looking inside quotes for keywords.

This is what I'm currently having issues with.  People break the BB quote code all the time in their posts.
legendary
Activity: 3346
Merit: 3125
July 16, 2024, 09:07:18 AM
#8
if you are scraping and parsing anyway, it would be nice if your search engine was indexing the most common board objects... For example, the username, DT rank, feedback, boards,... That way, you could use keywords, like you can in google (filetype:, site:,...).

But this wouldn't be like rebuilding the full forum on a database?

I mean, there are 2 ways to do this:

1.- You take all the forum data, and put it together on a database and then your search engine makes calls to that database. But for this, you will have to live update that database or at least have a cron job to add the new data each x time.

2.- Search for the data directly on the site, but for that, you would have to do some kind of hack to the current search engine.

If you have other way in mind i would love to know how it work.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
July 16, 2024, 07:03:39 AM
#7
For now, I am scraping topics from the forum using my bot.
If it helps, I can give you a tar.gz copy of my data (note: some posts are missing). I shared it with Ninjastic years ago, and it saves you several months of scraping. Freshly scraping will get you a more recent edit though, and less deleted posts.

Sure, you can send me a copy by PM.

if you are scraping and parsing anyway, it would be nice if your search engine was indexing the most common board objects... For example, the username, DT rank, feedback, boards,... That way, you could use keywords, like you can in google (filetype:, site:,...).

If would be nice if i could make a query like `user:Theymos board:Bitcoin\Project_Development +wallet -knots taproot` and i would only see posts made by Theymos in the project developent board that contained the word wallet, did not contain the word knots and hopefully contained the word taproot.

I can't index DT information since it's invisible to guests and it changes too quickly anyway, but I'm already scraping the other stuff like boards, username (of course) etcetera. Even the user IDs are being scraped to help deal with name changes.

My bot can also handle anonymous users too.
Pages:
Jump to: