Pages:
Author

Topic: Ninjastic.space - BitcoinTalk Post/Address archive + API - page 28. (Read 16824 times)

legendary
Activity: 2758
Merit: 6830
Finally, the new update!

- Board filter
You can now search posts by board. It includes their childrens, so you can select "Wallet software" to search on "Armory", "Electrum", "Hardware Wallet", etc... or just go right to the child board to search only the "Electrum" or "Bounties (Altcoins)" board. Caveat: a very small number of posts has an unknown board. So, if you search for a board, there is a small chance of missing a few posts.

- Archive updated
The archive went from ~42m posts without a title or exact date to just a couple thousands without that data! This means that you can now search by date range without getting a bunch of random posts that shouldn't be there (or missing most of them).

- New search backend
Searching is now a LOT better. It's faster, more accurate, returns more data and shouldn't doesn't crash my server!

- User stats page
You can now visit the new page (http://ninjastic.space/user/TryNinja) to check some data about an user: most active boards, graph of posts made in the last 7 and 30 days and his known addresses! New data will come later (with your suggestions).

- Find addresses by author
What about finding every known addresses an user has posted (BTC and ETH only) by searching for their username or checking their user page? Now you can!

- Deleted post in the post edit history
The bot will detect if a post was deleted less than 5 minutes after it was made/scraped and will mark it as so in the "post edit history" card.

Tip: CTRL + F5 in the website if you don't see the changes. Tell me what do you think about it. Smiley
legendary
Activity: 3654
Merit: 8909
https://bpip.org
Here are the timestamps:

https://bpip.org/timestamps_202009130840.zip (~160MB compressed, ~1.3GB uncompressed)

As I mentioned earlier, it's in UTC, 24h format. Here's a sample:

Code:
28,2009-11-22 18:04:28
29,2009-11-22 18:31:44
30,2009-11-22 18:32:00
31,2009-11-22 18:34:21
32,2009-11-22 18:35:15
33,2009-11-25 18:15:57
34,2009-11-25 18:17:23
36,2009-11-27 17:17:22
37,2009-11-27 17:27:09
38,2009-11-27 22:48:39
40,2009-12-09 05:34:46
41,2009-12-09 18:45:10
42,2009-12-09 19:25:31
43,2009-12-10 13:13:51
44,2009-12-10 14:00:17
45,2009-12-10 19:31:49
46,2009-12-10 20:49:02
47,2009-12-11 04:59:19
48,2009-12-11 17:20:29
49,2009-12-11 17:58:57
50,2009-12-11 19:27:55
51,2009-12-12 06:34:21
52,2009-12-12 13:08:17
53,2009-12-12 14:11:37
54,2009-12-12 17:52:44
55,2009-12-12 18:17:10
56,2009-12-12 18:23:59
57,2009-12-12 18:47:45
58,2009-12-12 20:46:14
59,2009-12-13 06:44:04
60,2009-12-13 06:46:30
61,2009-12-13 06:50:05
62,2009-12-13 16:51:25
63,2009-12-14 09:29:44
64,2009-12-14 13:09:48
65,2009-12-14 14:46:37
66,2009-12-14 15:01:39
67,2009-12-14 17:15:56
68,2009-12-15 05:21:09
69,2009-12-15 05:30:53
70,2009-12-15 20:37:32
71,2009-12-15 21:14:13
72,2009-12-16 15:49:23
73,2009-12-16 22:45:36
74,2009-12-17 11:36:36
75,2009-12-17 13:21:49
76,2009-12-17 13:23:43
77,2009-12-17 18:38:06
78,2009-12-18 15:11:53
79,2009-12-18 17:37:48
81,2009-12-30 01:40:50
82,2009-12-30 15:28:04
83,2010-01-01 18:09:58
84,2010-01-05 01:20:06
85,2010-01-05 20:00:46
86,2010-01-07 06:14:17
87,2010-01-11 16:13:20
88,2010-01-12 19:31:22
90,2010-01-13 04:13:37
91,2010-01-13 06:24:54
92,2010-01-13 07:45:57
93,2010-01-13 08:22:56
94,2010-01-13 17:12:16
95,2010-01-13 17:44:57
96,2010-01-13 19:08:19
97,2010-01-14 20:17:20
100,2010-01-15 09:42:18
101,2010-01-16 10:39:25
102,2010-01-16 14:27:15
103,2010-01-16 23:16:56
104,2010-01-16 23:22:55
105,2010-01-16 23:44:35
106,2010-01-16 23:45:22
107,2010-01-17 10:34:44
108,2010-01-17 22:55:31
109,2010-01-18 04:49:43
110,2010-01-18 11:06:35
111,2010-01-19 08:06:15
112,2010-01-20 20:07:15
113,2010-01-20 22:05:28
114,2010-01-21 17:30:39
115,2010-01-22 10:37:09
116,2010-01-24 02:48:37
117,2010-01-24 08:27:13
118,2010-01-24 09:52:48
119,2010-01-24 20:49:47
120,2010-01-24 20:52:59
121,2010-01-24 20:53:36
122,2010-01-24 22:48:30
123,2010-01-24 23:31:48
124,2010-01-25 02:42:13
125,2010-01-25 03:11:01
126,2010-01-25 04:06:03
127,2010-01-25 05:13:23
128,2010-01-25 05:28:39
129,2010-01-25 07:40:02
130,2010-01-25 16:30:50
131,2010-01-25 17:36:10
132,2010-01-25 19:25:29
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
There is a much better way to scrape recent posts without hitting that page every few seconds.

I have a feeling Theymos may have to eventually restrict parsing if everyone keeps doing it, so I'm putting an alternative up on github.  Smiley
legendary
Activity: 1820
Merit: 2700
Crypto Swap Exchange
If you are looking for every post, you can do this:
[...]

How is this relevant for this thread? Or did I miss something?
copper member
Activity: 1666
Merit: 1901
Amazon Prime Member #7
If you are looking for every post, you can do this:
total_posts = 5273824 #
for x in range(total_posts):
    page = 0
    #go to 'bitcointalk.org/index.php?topic={}.{}'.format(x, page)
    #if not available to you: pass
    #scrape board information
    #scrape each post via loop
    #I believe there are two classes of posts - scrape both classes, you will insert posts into your DB out of order, but this is okay
    page += 20
    #there is a middletext td class
    #there is a prevnext span class
    next_page = bitcointalk.org/index.php?topic={}.{}'.format(x, page)
    #sleep for 1 second
    #if you can find a link equal to next_page, goto that page, else pass

in parallel to the above, and starting at the same time the above starts:
scrape the recent posts page, and add each post to your DB. Here you can scrape the board each thread is on, via adding it if it doesn't exist in your DB, and updating it if it doesn't exist.

The above will capture every post that you have access to. The first loop will take quite some time, and a thread being moved to a different board while you are in the process of scraping all posts will not cause you to miss any posts.
legendary
Activity: 2758
Merit: 6830
~
Thank you. I think I got it.

I went to https://bitcointalk.org/sitemap.php?t=b, grabbed every board url, scraped each one of them and linked them to their closest parent. There are only 248 boards, so it was pretty quick.

Code:
"board_id","name","parent_id"
1,   Bitcoin Discussion,
4,   Bitcoin Technical Support,
14,  Mining,
40,  Mining support,   14
42,  Mining software (miners),   14

If you want it: https://pastebin.com/raw/xhudKFZ8
legendary
Activity: 3654
Merit: 8909
https://bpip.org
In this case, you can send me the boards right away so I can figure out how to do that. Cheesy

Thanks!

Here you go:

https://bpip.org/boards_202009112042.zip (~100MB compressed, ~600MB uncompresed)

You may be able to get board details from here:

https://bitcointalk.org/index.php?action=search

It's a bit messy but it has all boards listed on one page. Otherwise you'd have to recursively scrape multiple pages starting from the front page.

Another option - if you are currently scraping recent posts including the full path (with board names and IDs) then you can extract the hierarchy from that data. It might not be complete though. Some boards that are rarely posted in might need to be added manually.
legendary
Activity: 2758
Merit: 6830
I have only the ID of the direct "parent" board (24 in your example). You would need to scrape the board hierarchy if you need the full path and the names of the boards.

Also that timestamp stuff will take me a day or two so if you want boards sooner - let me know, I can send it separately.
In this case, you can send me the boards right away so I can figure out how to do that. Cheesy

Thanks!
legendary
Activity: 3654
Merit: 8909
https://bpip.org
I do.
Would you be able to send them to me in this format (along with the post date)?

Code:
postid, date, boards
55179038, "2020-04-13 12:03:00", "{Other, Meta}"

I have only the ID of the direct "parent" board (24 in your example). You would need to scrape the board hierarchy if you need the full path and the names of the boards.

Also that timestamp stuff will take me a day or two so if you want boards sooner - let me know, I can send it separately.
legendary
Activity: 2758
Merit: 6830
I do.
Would you be able to send them to me in this format (along with the post date)?

Code:
postid, date, boards
55179038, "2020-04-13 12:03:00", "{Other, Meta}"

One of the things I had wanted to with BPIP was breakdown posts per hour per section of the forum.  People could see when the best time to post would be. Obviously this can be done easily using the infrastructure you have set up.  
That sounds like an easy one. I will add it!
legendary
Activity: 3654
Merit: 8909
https://bpip.org
Would you also have the boards of the posts? The old archive also doesn't contain them. Cheesy

I do.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
One of the things I had wanted to with BPIP was breakdown posts per hour per section of the forum.  People could see when the best time to post would be. Obviously this can be done easily using the infrastructure you have set up. 

legendary
Activity: 2758
Merit: 6830
Quick update on this: the timezone mess is messier than I thought so it will take some time to sort through it. Basically some posts got scraped with +0200 instead of UTC but I don't know which ones, so I'll probably need to scrape some "checkpoints" and find posts between them that are "time traveling" (e.g. created later than the next post).
Would you also have the boards of the posts? The old archive also doesn't contain them. Cheesy

I finished setting up the new database and already have some cool new features to announce. But I'm waiting for the timestamps and potentially the boards so I can index the data.

Here is a sneak peek of one of them (WIP): https://talkimg.com/images/2023/05/14/blobf707e32c89df6b5f.png
legendary
Activity: 3654
Merit: 8909
https://bpip.org
Sure. I need to double-check a few things first. At one point I had some issues with timezones so I'll verify if I need to make any adjustments. It will all be in UTC once it's ready.

Quick update on this: the timezone mess is messier than I thought so it will take some time to sort through it. Basically some posts got scraped with +0200 instead of UTC but I don't know which ones, so I'll probably need to scrape some "checkpoints" and find posts between them that are "time traveling" (e.g. created later than the next post).
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
How about a blockchain search?

4 values - min,max number of bitcoins transferred.   Start,end date to search.
Blockchair has this data. It might take more disk space than all Bitcointalk posts.

I have several topics based on it already (but I don't do databases):
legendary
Activity: 1820
Merit: 2700
Crypto Swap Exchange
I could also maybe implement some kind of authentication in the future.
I think you should. Otherwise you are just asking for a DoS attack Wink

Lack of authentication doesn't mean there are no other DDoS mitigation measures implemented. Just saying... Wink

Btw, you messed the quotes up, that was TryNinja's quote.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
Let me know and I'll implement them if possible.

How about a blockchain search?

4 values - min,max number of bitcoins transferred.   Start,end date to search.
For example - I want to search for any transfers between 450-500 bitcoin between Sep 1 and Oct 31 2015.

If you integrate the crypto price in the search, I could also search for transfers of $40-$50 for example.


sr. member
Activity: 840
Merit: 375
It's a simple HTTP request:

https://api.ninjastic.space/posts/55141939

Status code 200 means the post exists and you can parse JSON from the response. 404 means not found, etc.

Oh if it's just a simple HTTP request then I am familiar with that  Grin I thought it was some kind of special interface with mandatory authorization via an api key....

You can use them as you wish for now. But I would appreciate if you consulted me before doing many requests or implementing it in any kind of project. This way we can optimize things to keep the server working without too much workload.

I'm still working on the bot right now, if I deem it useful enough to release it publicly one day, I will definitely let you know before so we can optimize it.

I could also maybe implement some kind of authentication in the future.
I think you should. Otherwise you are just asking for a DoS attack Wink
legendary
Activity: 2758
Merit: 6830
Nice job.  I have some ideas that could help in scam busting.   I left you merit, but even better than that, I've left you my trust.
Thanks, Vod. Smiley

Let me know and I'll implement them if possible.

Any chance you can recover other posts from the Internet Archive or one of the bitcointalk clone sites?     Some scammers have deleted hundreds of posts of illegal activities.
It's technically possible, but I'm not sure how hard that would be. I'm priorizing scraping all the live posts that are missing from the database. When everything is working and most features are done, I may think about doing that.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
The new ninjastic.space website is out! Remade from scratch.

Nice job.  I have some ideas that could help in scam busting.   I left you merit, but even better than that, I've left you my trust.

- The post archive is still incomplete as many posts from this year are missing. It has, however, a lot more posts than its previous version: 42,785,512 posts! Mostly from the previous years. (thanks to @LoyceV for his oldposts archive).

Any chance you can recover other posts from the Internet Archive or one of the bitcointalk clone sites?     Some scammers have deleted hundreds of posts of illegal activities.
Pages:
Jump to: