Pages:
Author

Topic: Ninjastic.space - BitcoinTalk Post/Address archive + API - page 25. (Read 15134 times)

legendary
Activity: 2758
Merit: 6830
An option not to include child board of selected board would make it better Smiley
Noted!



Some suggestions for finding addresses by author:
- When listing ETH addresses, it also displays ETH transactions (TXIDs), perhaps the results should be filtered by string length
Hmm.. I don't know. I could technically use a blockchain's API for that. Will see. Cheesy

- You should ignore the results that are within the quote tags (if possible)
It's on my TODO list.

- Address grouping should not be case sensitive
I didn't think of that. But I'll leave it that way for now. At least BTC addresses are case sensitive, so it shouldn't be an issue.



The field "Known Addresses" is a very inaccurate name imo. I saw about 10 addresses in my profile, and none of them were mine. MOst of them quotes or addresses that were being discussed.
The bot doesn't know that. "known" doesn't necessarily mean you own it. It's just the ones the bot has found in any of your (scraped) posts. There could be others, so those are the ones he knows about. Cheesy

I can change it to "Mentioned Addresses" to make that clear.
legendary
Activity: 2212
Merit: 5622
Non-custodial BTC Wallet
- Find addresses by author
What about finding every known addresses an user has posted (BTC and ETH only) by searching for their username or checking their user page? Now you can!
Smiley

Beautiful update.

I liked the design.

The field "Known Addresses" is a very inaccurate name imo. I saw about 10 addresses in my profile, and none of them were mine. MOst of them quotes or addresses that were being discussed.

Maybe you could use a different name, such as "possible addresses" or "mentioned addresses"
legendary
Activity: 1568
Merit: 2581
Top Crypto Casino
- Find addresses by author
What about finding every known addresses an user has posted (BTC and ETH only) by searching for their username or checking their user page? Now you can!


This is a pretty nice update! I haven't explored everything yet.

Some suggestions for finding addresses by author:
- When listing ETH addresses, it also displays ETH transactions (TXIDs), perhaps the results should be filtered by string length
- You should ignore the results that are within the quote tags (if possible)
- Address grouping should not be case sensitive

Otherwise, great job!
legendary
Activity: 2758
Merit: 6830
Finally, the new update!

- Board filter
You can now search posts by board. It includes their childrens, so you can select "Wallet software" to search on "Armory", "Electrum", "Hardware Wallet", etc... or just go right to the child board to search only the "Electrum" or "Bounties (Altcoins)" board. Caveat: a very small number of posts has an unknown board. So, if you search for a board, there is a small chance of missing a few posts.

- Archive updated
The archive went from ~42m posts without a title or exact date to just a couple thousands without that data! This means that you can now search by date range without getting a bunch of random posts that shouldn't be there (or missing most of them).

- New search backend
Searching is now a LOT better. It's faster, more accurate, returns more data and shouldn't doesn't crash my server!

- User stats page
You can now visit the new page (http://ninjastic.space/user/TryNinja) to check some data about an user: most active boards, graph of posts made in the last 7 and 30 days and his known addresses! New data will come later (with your suggestions).

- Find addresses by author
What about finding every known addresses an user has posted (BTC and ETH only) by searching for their username or checking their user page? Now you can!

- Deleted post in the post edit history
The bot will detect if a post was deleted less than 5 minutes after it was made/scraped and will mark it as so in the "post edit history" card.

Tip: CTRL + F5 in the website if you don't see the changes. Tell me what do you think about it. Smiley
legendary
Activity: 3654
Merit: 8909
https://bpip.org
Here are the timestamps:

https://bpip.org/timestamps_202009130840.zip (~160MB compressed, ~1.3GB uncompressed)

As I mentioned earlier, it's in UTC, 24h format. Here's a sample:

Code:
28,2009-11-22 18:04:28
29,2009-11-22 18:31:44
30,2009-11-22 18:32:00
31,2009-11-22 18:34:21
32,2009-11-22 18:35:15
33,2009-11-25 18:15:57
34,2009-11-25 18:17:23
36,2009-11-27 17:17:22
37,2009-11-27 17:27:09
38,2009-11-27 22:48:39
40,2009-12-09 05:34:46
41,2009-12-09 18:45:10
42,2009-12-09 19:25:31
43,2009-12-10 13:13:51
44,2009-12-10 14:00:17
45,2009-12-10 19:31:49
46,2009-12-10 20:49:02
47,2009-12-11 04:59:19
48,2009-12-11 17:20:29
49,2009-12-11 17:58:57
50,2009-12-11 19:27:55
51,2009-12-12 06:34:21
52,2009-12-12 13:08:17
53,2009-12-12 14:11:37
54,2009-12-12 17:52:44
55,2009-12-12 18:17:10
56,2009-12-12 18:23:59
57,2009-12-12 18:47:45
58,2009-12-12 20:46:14
59,2009-12-13 06:44:04
60,2009-12-13 06:46:30
61,2009-12-13 06:50:05
62,2009-12-13 16:51:25
63,2009-12-14 09:29:44
64,2009-12-14 13:09:48
65,2009-12-14 14:46:37
66,2009-12-14 15:01:39
67,2009-12-14 17:15:56
68,2009-12-15 05:21:09
69,2009-12-15 05:30:53
70,2009-12-15 20:37:32
71,2009-12-15 21:14:13
72,2009-12-16 15:49:23
73,2009-12-16 22:45:36
74,2009-12-17 11:36:36
75,2009-12-17 13:21:49
76,2009-12-17 13:23:43
77,2009-12-17 18:38:06
78,2009-12-18 15:11:53
79,2009-12-18 17:37:48
81,2009-12-30 01:40:50
82,2009-12-30 15:28:04
83,2010-01-01 18:09:58
84,2010-01-05 01:20:06
85,2010-01-05 20:00:46
86,2010-01-07 06:14:17
87,2010-01-11 16:13:20
88,2010-01-12 19:31:22
90,2010-01-13 04:13:37
91,2010-01-13 06:24:54
92,2010-01-13 07:45:57
93,2010-01-13 08:22:56
94,2010-01-13 17:12:16
95,2010-01-13 17:44:57
96,2010-01-13 19:08:19
97,2010-01-14 20:17:20
100,2010-01-15 09:42:18
101,2010-01-16 10:39:25
102,2010-01-16 14:27:15
103,2010-01-16 23:16:56
104,2010-01-16 23:22:55
105,2010-01-16 23:44:35
106,2010-01-16 23:45:22
107,2010-01-17 10:34:44
108,2010-01-17 22:55:31
109,2010-01-18 04:49:43
110,2010-01-18 11:06:35
111,2010-01-19 08:06:15
112,2010-01-20 20:07:15
113,2010-01-20 22:05:28
114,2010-01-21 17:30:39
115,2010-01-22 10:37:09
116,2010-01-24 02:48:37
117,2010-01-24 08:27:13
118,2010-01-24 09:52:48
119,2010-01-24 20:49:47
120,2010-01-24 20:52:59
121,2010-01-24 20:53:36
122,2010-01-24 22:48:30
123,2010-01-24 23:31:48
124,2010-01-25 02:42:13
125,2010-01-25 03:11:01
126,2010-01-25 04:06:03
127,2010-01-25 05:13:23
128,2010-01-25 05:28:39
129,2010-01-25 07:40:02
130,2010-01-25 16:30:50
131,2010-01-25 17:36:10
132,2010-01-25 19:25:29
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
There is a much better way to scrape recent posts without hitting that page every few seconds.

I have a feeling Theymos may have to eventually restrict parsing if everyone keeps doing it, so I'm putting an alternative up on github.  Smiley
legendary
Activity: 1568
Merit: 2581
Top Crypto Casino
If you are looking for every post, you can do this:
[...]

How is this relevant for this thread? Or did I miss something?
copper member
Activity: 1610
Merit: 1899
Amazon Prime Member #7
If you are looking for every post, you can do this:
total_posts = 5273824 #
for x in range(total_posts):
    page = 0
    #go to 'bitcointalk.org/index.php?topic={}.{}'.format(x, page)
    #if not available to you: pass
    #scrape board information
    #scrape each post via loop
    #I believe there are two classes of posts - scrape both classes, you will insert posts into your DB out of order, but this is okay
    page += 20
    #there is a middletext td class
    #there is a prevnext span class
    next_page = bitcointalk.org/index.php?topic={}.{}'.format(x, page)
    #sleep for 1 second
    #if you can find a link equal to next_page, goto that page, else pass

in parallel to the above, and starting at the same time the above starts:
scrape the recent posts page, and add each post to your DB. Here you can scrape the board each thread is on, via adding it if it doesn't exist in your DB, and updating it if it doesn't exist.

The above will capture every post that you have access to. The first loop will take quite some time, and a thread being moved to a different board while you are in the process of scraping all posts will not cause you to miss any posts.
legendary
Activity: 2758
Merit: 6830
~
Thank you. I think I got it.

I went to https://bitcointalk.org/sitemap.php?t=b, grabbed every board url, scraped each one of them and linked them to their closest parent. There are only 248 boards, so it was pretty quick.

Code:
"board_id","name","parent_id"
1,   Bitcoin Discussion,
4,   Bitcoin Technical Support,
14,  Mining,
40,  Mining support,   14
42,  Mining software (miners),   14

If you want it: https://pastebin.com/raw/xhudKFZ8
legendary
Activity: 3654
Merit: 8909
https://bpip.org
In this case, you can send me the boards right away so I can figure out how to do that. Cheesy

Thanks!

Here you go:

https://bpip.org/boards_202009112042.zip (~100MB compressed, ~600MB uncompresed)

You may be able to get board details from here:

https://bitcointalk.org/index.php?action=search

It's a bit messy but it has all boards listed on one page. Otherwise you'd have to recursively scrape multiple pages starting from the front page.

Another option - if you are currently scraping recent posts including the full path (with board names and IDs) then you can extract the hierarchy from that data. It might not be complete though. Some boards that are rarely posted in might need to be added manually.
legendary
Activity: 2758
Merit: 6830
I have only the ID of the direct "parent" board (24 in your example). You would need to scrape the board hierarchy if you need the full path and the names of the boards.

Also that timestamp stuff will take me a day or two so if you want boards sooner - let me know, I can send it separately.
In this case, you can send me the boards right away so I can figure out how to do that. Cheesy

Thanks!
legendary
Activity: 3654
Merit: 8909
https://bpip.org
I do.
Would you be able to send them to me in this format (along with the post date)?

Code:
postid, date, boards
55179038, "2020-04-13 12:03:00", "{Other, Meta}"

I have only the ID of the direct "parent" board (24 in your example). You would need to scrape the board hierarchy if you need the full path and the names of the boards.

Also that timestamp stuff will take me a day or two so if you want boards sooner - let me know, I can send it separately.
legendary
Activity: 2758
Merit: 6830
I do.
Would you be able to send them to me in this format (along with the post date)?

Code:
postid, date, boards
55179038, "2020-04-13 12:03:00", "{Other, Meta}"

One of the things I had wanted to with BPIP was breakdown posts per hour per section of the forum.  People could see when the best time to post would be. Obviously this can be done easily using the infrastructure you have set up.  
That sounds like an easy one. I will add it!
legendary
Activity: 3654
Merit: 8909
https://bpip.org
Would you also have the boards of the posts? The old archive also doesn't contain them. Cheesy

I do.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
One of the things I had wanted to with BPIP was breakdown posts per hour per section of the forum.  People could see when the best time to post would be. Obviously this can be done easily using the infrastructure you have set up. 

legendary
Activity: 2758
Merit: 6830
Quick update on this: the timezone mess is messier than I thought so it will take some time to sort through it. Basically some posts got scraped with +0200 instead of UTC but I don't know which ones, so I'll probably need to scrape some "checkpoints" and find posts between them that are "time traveling" (e.g. created later than the next post).
Would you also have the boards of the posts? The old archive also doesn't contain them. Cheesy

I finished setting up the new database and already have some cool new features to announce. But I'm waiting for the timestamps and potentially the boards so I can index the data.

Here is a sneak peek of one of them (WIP): https://talkimg.com/images/2023/05/14/blobf707e32c89df6b5f.png
legendary
Activity: 3654
Merit: 8909
https://bpip.org
Sure. I need to double-check a few things first. At one point I had some issues with timezones so I'll verify if I need to make any adjustments. It will all be in UTC once it's ready.

Quick update on this: the timezone mess is messier than I thought so it will take some time to sort through it. Basically some posts got scraped with +0200 instead of UTC but I don't know which ones, so I'll probably need to scrape some "checkpoints" and find posts between them that are "time traveling" (e.g. created later than the next post).
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
How about a blockchain search?

4 values - min,max number of bitcoins transferred.   Start,end date to search.
Blockchair has this data. It might take more disk space than all Bitcointalk posts.

I have several topics based on it already (but I don't do databases):
legendary
Activity: 1568
Merit: 2581
Top Crypto Casino
I could also maybe implement some kind of authentication in the future.
I think you should. Otherwise you are just asking for a DoS attack Wink

Lack of authentication doesn't mean there are no other DDoS mitigation measures implemented. Just saying... Wink

Btw, you messed the quotes up, that was TryNinja's quote.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
Let me know and I'll implement them if possible.

How about a blockchain search?

4 values - min,max number of bitcoins transferred.   Start,end date to search.
For example - I want to search for any transfers between 450-500 bitcoin between Sep 1 and Oct 31 2015.

If you integrate the crypto price in the search, I could also search for transfers of $40-$50 for example.


Pages:
Jump to: