Bitcointalk Search Project - page 3.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: Vod on July 28, 2024, 01:19:15 AM

I explained how to do this five posts ago....

^^^

Wow you're fast Grin

You mean this?

Quote from: Vod on July 20, 2024, 01:11:28 PM

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

Quote from: NotATether on July 28, 2024, 01:18:15 AM

My scraper was broken by Cloudflare after about 58K posts or so. I can use proxies to get around it (while keeping the same rate limit), but I will need to figure out how to integrate that into the code.

I explained how to do this five posts ago....

^^^

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

My scraper was broken by Cloudflare after about 58K posts or so. I can use proxies to get around it (while keeping the same rate limit), but I will need to figure out how to integrate that into the code.

I do however have LoyceV's archive (thanks Loyce) But I am not sure whether it covers posts before 2018.

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

Quote from: NeuroticFish on July 20, 2024, 02:10:16 PM

There's no option to ask Theymos to offer - in a way or another - the list of "last" edited posts (maybe after a date)?
I mean that the forum knows them, it only has to offer them somehow.

The only reason he couldn't do this would be if the indexing would slow down the site. I wouldn't imagine any native SMF code that searches based on that key. :/

NeuroticFish

legendary

Activity: 3668

Merit: 6382

Looking for campaign manager? Contact icopress!

Quote from: NotATether on July 20, 2024, 12:03:19 PM

Edit: I will implement a heuristic that tracks when the last time a person logged into their account, and correlates that to the frequency of posts they made on particular days versus now, Banned users and inactive accounts that haven't posted for 120 days (or whatever the "This account recently woke up from a long period of inactivity" threshold is) can be excluded, so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

There's no option to ask Theymos to offer - in a way or another - the list of "last" edited posts (maybe after a date)?
I mean that the forum knows them, it only has to offer them somehow.

FatFork

legendary

Activity: 1820

Merit: 2700

Crypto Swap Exchange

Quote from: NotATether on July 20, 2024, 12:56:48 PM

It's not perfect obviously - edited posts probably won't be indexed for several hours like that* - but it's the best I can think of.

I also think that it won't be feasible, at least not with a single scraper. As Vod said, his post history exceeds 21,000 entries. Continuously monitoring all his posts for edits would be very resource-intensive and/or time-consuming. And what about other members with even more activity, like philipma1957, BADecker, JayJuanGee, franky1, and others? We're talking hundreds of thousands of posts that would need parsing every day...

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

Quote from: NotATether on July 20, 2024, 12:56:48 PM

That's what I'm saying. I will have to manually go through all the user's posts (but hey, at least I'll have their message and topic IDs this time!) but at least I know that statistically, I will only need to search a few users at a time, because editing is infrequent compared to posting.

Manual processes will cause this project to fail - too much data.

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: Vod on July 20, 2024, 12:45:13 PM

There is no way for you to track which post I may edit. You would need to rescan all my 22k posts on a regular basis, or use some AI to determine what I may have edited. The number of seconds in a day is limited to about 4x my number of posts, and I'm sure there are more than three other people posting.

That's what I'm saying. I will have to manually go through all the user's posts (but hey, at least I'll have their message and topic IDs this time!) but at least I know that statistically, I will only need to search a few users at a time, because editing is infrequent compared to posting.

It's not perfect obviously - edited posts probably won't be indexed for several hours like that* - but it's the best I can think of.

* When the initial download is finished

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

Quote from: NotATether on July 20, 2024, 12:03:19 PM

so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

There is no way for you to track which post I may edit. You would need to rescan all my 22k posts on a regular basis, or use some AI to determine what I may have edited. The number of seconds in a day is limited to about 4x my number of posts, and I'm sure there are more than three other people posting.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: seoincorporation on July 20, 2024, 08:32:42 AM

We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?

This is going to be a live search engine, so every post is going to be kept up to date, and removed if the original post is removed as well.

Edit: I will implement a heuristic that tracks when the last time a person logged into their account, and correlates that to the frequency of posts they made on particular days versus now, Banned users and inactive accounts that haven't posted for 120 days (or whatever the "This account recently woke up from a long period of inactivity" threshold is) can be excluded, so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

The number of new posts daily >>> The number of edited posts daily

*I do not currently track the "last edited" time because it is an unreliable indicator for determining whether a given post might be edited in the future.

TryNinja

legendary

Activity: 2758

Merit: 6830

Quote from: seoincorporation on July 20, 2024, 08:32:42 AM

We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?

There is no easy solution for that. It's impossible to keep track of 55 million posts.

seoincorporation

legendary

Activity: 3388

Merit: 3154

Quote from: Vod on July 18, 2024, 11:01:23 PM

Quote from: seoincorporation on July 18, 2024, 10:46:24 PM

I'm curious about the method that you are using to get the data from each thread

You understand how we get the info - just not how we know what info to get.

It depends on what data you want to process. If you just want to capture each thread, parse the front page every 10 seconds and grab the link from the "Newest Posts" area.

Ok, then what would happen if a user edited his post once it was recorded in the search engine Database? That way will be impossible to search for new information on edited posts, and i feel like that will be a problem for this project.

We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: NotATether on July 19, 2024, 07:12:29 AM

Quote from: LoyceV on July 19, 2024, 12:56:46 AM

In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.

Is that a compressed tarball?

Yes.

Quote

Because for my daily backups I usually tar my folders without compression to make it go many times faster.

I'm using pigz, which now only uses 1% of 1 CPU core. Reading from disk is the limitation. I thought I'd do you a favour by making one file instead of giving you half a million compressed files.

Quote

I'm considering putting together a website and releasing the search, but with incomplete data for now. So far, I have over 23k posts (up to roughly July 2011), but I would not like to wait months before people can actually use this.

By all means: use my data for this

Only a few posts are censored by me. Other than that, the file format is pretty much the same everywhere.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: LoyceV on July 19, 2024, 12:56:46 AM

In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.

Is that a compressed tarball? Because for my daily backups I usually tar my folders without compression to make it go many times faster.

I'm considering putting together a website and releasing the search, but with incomplete data for now. So far, I have over 23k posts (up to roughly July 2011), but I would not like to wait months before people can actually use this.

It would also give me time to figure out the best setup for this.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: NotATether on July 16, 2024, 07:03:39 AM

Quote from: LoyceV on July 16, 2024, 04:27:46 AM

Quote from: NotATether on July 16, 2024, 03:01:35 AM

For now, I am scraping topics from the forum using my bot.

If it helps, I can give you a tar.gz copy of my data

Sure, you can send me a copy by PM.

In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

Quote from: seoincorporation on July 18, 2024, 10:46:24 PM

I'm curious about the method that you are using to get the data from each thread

You understand how we get the info - just not how we know what info to get.

It depends on what data you want to process. If you just want to capture each thread, parse the front page every 10 seconds and grab the link from the "Newest Posts" area.

seoincorporation

legendary

Activity: 3388

Merit: 3154

I'm curious about the method that you are using to get the data from each thread, there is a command on Linux called lynx, it is a web browser for the command line, and with that, you can get the text from a website or the source code:

Code:

lynx --dump "https://bitcointalk.org/index.php?topic=5503125.0"

Code:

lynx --source "https://bitcointalk.org/index.php?topic=5503125.0"

You could use some tools like cut and grep to get only the relative data. Making the script would be the easy part, getting the data from 5.5 million of threads will be the hard part, lol And the fact that each thread could have multiple pages makes it a challenge.

Vod

legendary

Activity: 3668

Merit: 3010

Licking my boob since 1970

Quote from: NotATether on July 18, 2024, 06:23:55 AM

That's a security risk though, since it would require me to also store my password in plain text.

Never store passwords in plain text. Use a secrets manager, like AWS, to enter your password at runtime.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: TryNinja on July 18, 2024, 12:34:56 PM

If you keep track of every new topic, all you need to do is add +1 to the ID and see if it exists. Some problems might arrive, like the thread being deleted before you check, meaning that the last ID has changed, but there are ways of minimizing this.

I could try checking the next sequence of 100 posts or so, in order to check for new posts, since it's extremely unlikely that they were all deleted.

TryNinja

legendary

Activity: 2758

Merit: 6830

Quote from: seoincorporation on July 18, 2024, 09:09:58 AM

The problem is the moment to index the new threads, without RSS we have to "Ping" each board to know if there is new content.

You don't have to do that.

If you keep track of every new topic, all you need to do is add +1 to the ID and see if it exists. Some problems might arrive, like the thread being deleted before you check, meaning that the last ID has changed, but there are ways of minimizing this.

For my scraper I did more of a workaround checking the url of the most recent posts to see if there might be zero replies on its thread, which would imply it is an OP.

Topic: Bitcointalk Search Project - page 3. (Read 1110 times)