Pages:
Author

Topic: List of all Bitcoin addresses ever used - currently UNavailable on temp location - page 8. (Read 4068 times)

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
@LoyceV how large is the uncompressed addresses.txt.gz?
It gets around 50% larger, Bitcoin addresses don't compress very well.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
@LoyceV how large is the uncompressed addresses.txt.gz? It is at least 200GB and counting and it's still extracting legacy addresses. I'm worried I may run out of disk space before it's all extracted. I have a 1TB quota. If you know how big is the uncompressed unique_addresses.txt.gz while you're at it that will be useful to know.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
That's strange because all AWS servers have an SSD configured as the boot disk.
I guess it wasn't clear that alladdresses.loyce.club:20319 doesn't run at AWS. It uses HDD.

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Quote
-S will tell your machine to use at most 65% CPU
I think you mean RAM, not CPU. This VM has only 256 MB, so I'll let "sort" figure it out on it's own.

That is correct, the argument to -S is the amount of memory for sort(1) to use for its main buffer (manpage source). With a percentage it should calculate the amount of memory to reserve. But I think even a 256MB buffer is too small for the size of the dataset you're sorting, it will hit the disk too much.

Quote
-T puts temporary files in a directory (here named "tmp") and not in RAM; if you have an SSD, the speed isn't too shabby
That's default behaviour Smiley It doesn't have an SSD though, and I'm using "cputool" to keep server load low. I'm okay without daily updates on this, I wouldn't want users to download this large file on a daily basis anyway.

Quote
I have sorted huge lists (>80 GB) on budget laptops using these two arguments. Worth a shot! If you want better hosting, PM me.
Since last year, I'm using an AWS server donated by suchmoon for loyce.club. However, since AWS charges $0.15/GB, I'm not comfortable hosting very large files on suchmoon's server.
When I tested sorting data on AWS, it started throtting disk IO after a while, which made it very slow. I've also tested a pay-by-the-hour-VPS, and obviously it was a lot faster.

That's strange because all AWS servers have an SSD configured as the boot disk. If you are sorting in a VM, then all that sorting is done in a virtual hard disk, so not only are you moving memory into temporary host SSD space, it's being moved inside a virtual disk file inside said SSD and that puts extra strain on your hypervisor's emulated disk controller.

So, it's emulating all the disk controller calls that read and write data from the disk, updates disk cache and its other jobs while sort(1) moves data between its memory buffer in RAM and the hard disk (which is actually a file on your host). And it's doing that for the entire 31GB of addresses, and the algorithm sort uses needs an O(n log(n)) space, which I calculate to be 310GB for your data. All this while running emulated disk writes and reads. On top of that there is the hardware-accelerated reads and writes that the host does for the VM to it's disk file. That explains the poor performance while sorting.

You'll have better disk performance if you sort outside of a VM.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Code:
cat unsorted.txt | sort -u -S 65% -T tmp > sorted.txt
I'm already using "sort", which uses /tmp by default.

I'll try "sort -u" though, it might need less temporary storage than "sort | uniq". The next update is scheduled for tomorrow, I'll see how it performs.

Quote
-S will tell your machine to use at most 65% CPU
I think you mean RAM, not CPU. This VM has only 256 MB, so I'll let "sort" figure it out on it's own.

Quote
-T puts temporary files in a directory (here named "tmp") and not in RAM; if you have an SSD, the speed isn't too shabby
That's default behaviour Smiley It doesn't have an SSD though, and I'm using "cputool" to keep server load low. I'm okay without daily updates on this, I wouldn't want users to download this large file on a daily basis anyway.

Quote
I have sorted huge lists (>80 GB) on budget laptops using these two arguments. Worth a shot! If you want better hosting, PM me.
Since last year, I'm using an AWS server donated by suchmoon for loyce.club. However, since AWS charges $0.15/GB, I'm not comfortable hosting very large files on suchmoon's server.
When I tested sorting data on AWS, it started throtting disk IO after a while, which made it very slow. I've also tested a pay-by-the-hour-VPS, and obviously it was a lot faster.

There's one thing on my wish list though: a method to show only unique addresses in order of appearance (without sorting them). It can be done with awk '!a[$0]++', but this requires a lot of memory and doesn't use temporary files.
copper member
Activity: 193
Merit: 255
Click "+Merit" top-right corner

Updates
Sorting a list that doesn't fit in the server's RAM is very slow. Therefore I only update unique_addresses.txt.gz twice a month (on the 6th and 21st). Check the file date here to see how old it is. If an update fails, please post here.
In between updates, I create daily updates: alladdresses.loyce.club:20319/daily_updates/. These txt-files contain unique addresses (for that day) in order of appearance.
Due to limitations in disk space, I don't do automatic updates for addresses.txt.gz. It's complete until blockchair_bitcoin_outputs_20200719.tsv.gz.



This is a wonderful initiative! A comment: Sorting a very large list with little RAM is not necessarily a problem! Try:


Code:
mkdir tmp
cat unsorted.txt | sort -u -S 65% -T tmp > sorted.txt
rm -r tmp

-S will tell your machine to use at most 65% CPU; this is some sort of optimum, according to my experience
-T puts temporary files in a directory (here named "tmp") and not in RAM; if you have an SSD, the speed isn't too shabby

I have sorted huge lists (>80 GB) on budget laptops using these two arguments. Worth a shot! If you want better hosting, PM me.
legendary
Activity: 2758
Merit: 6830
Great, saves me the trouble Smiley
Can I request a CSV of all the results? That makes it so much easier to use all data than getting them per address through your site.
Just something with (at least) "address,userID,msgID" would be great for further analysis.
Of course. Once in the database, it's pretty easy to export them to the format I want.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
This is planned for my post archive. I had done that but only with ETH addresses and the 15m posts you sent me + the new scraped one.
Great, saves me the trouble Smiley
Can I request a CSV of all the results? That makes it so much easier to use all data than getting them per address through your site.
Just something with (at least) "address,userID,msgID" would be great for further analysis.

I'm still on the planning stage to which should I go first and with many scraped data you've done, it would help me to make less scraping but rather make an API to just look up on your data.
I can get you a copy of all archived posts like I gave TryNinja if it helps. It beats scraping the forum again, although I didn't keep track of board names per topic.
legendary
Activity: 2758
Merit: 6830
I could run this code on 53 million archived posts, but the main problem will be excluding quotes. That's annoying and slow to do, and if I don't exclude them, it will completely mess up the data. On the other hand, quotes may still contain information that was deleted by the user who posted it.
Even without quotes, users still post Bitcoin addresses that aren't theirs, for instance when providing evidence on a scammer.
This is planned for my post archive. I had done that but only with ETH addresses and the 15m posts you sent me + the new scraped one.

I plan to scan all old posts + new ones for ETH and BTC addresses after everything is working fine (new bot + full database with the whole post archive).
hero member
Activity: 2184
Merit: 891
Leading Crypto Sports Betting and Casino Platform
Can you also scrape all the Bitcoin Address used here in forum and the user that uses it?
I actually can Cheesy I found this regexp on Stackoverflow:
Code:
egrep --regexp="^[13][a-km-zA-HJ-NP-Z1-9]{25,34}$" filename
With some slight changes it stops matching parts of Eth-addresses:
Code:
egrep -w --regexp="[13][a-km-zA-HJ-NP-Z1-9]{25,34}" *
I could run this code on 53 million archived posts, but the main problem will be excluding quotes. That's annoying and slow to do, and if I don't exclude them, it will completely mess up the data. On the other hand, quotes may still contain information that was deleted by the user who posted it.
Even without quotes, users still post Bitcoin addresses that aren't theirs, for instance when providing evidence on a scammer.

I think it would be possible if and only if you scraped the following boards:
  • Services
  • Bounties
  • Marketplace in general (both BTC and Alt)
  • And Marketplaces of all local boards if applicable/available

With that, detection with evidences on a scam wouldn't be a problem to the matter. And yes, it would be hard especially if threads/posts were deleted. But it mustn't be a problem as long as a list can be made to simply be a reference of which user had used nor mentioned any addresses throughout his post history.

Quote
I think it would help labeling the users and alt accounts throughout the entire forum, and would make it easier to detect which accounts are linked to each other
A smart user would simply use different addresses. An even smarter user would use different wallets, so they don't create a blockchain trail when they make a payment.
As a quick test, 51 out of 9999 posts contain at least one Bitcoin address (starting with 1 or 3, ignoring Bech32).
For now I won't go continue this search. If I ever do, I'll move this discussion to Reputation.

I'm looking forward to make it happen. Have I already mentioned my project on making an app (a BPIP ripoff) and such data would be helpful in it. I'm still on the planning stage to which should I go first and with many scraped data you've done, it would help me to make less scraping but rather make an API to just look up on your data.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Can you also scrape all the Bitcoin Address used here in forum and the user that uses it?
I actually can Cheesy I found this regexp on Stackoverflow:
Code:
egrep --regexp="^[13][a-km-zA-HJ-NP-Z1-9]{25,34}$" filename
With some slight changes it stops matching parts of Eth-addresses:
Code:
egrep -w --regexp="[13][a-km-zA-HJ-NP-Z1-9]{25,34}" *

I could run this code on 53 million archived posts, but the main problem will be excluding quotes. That's annoying and slow to do, and if I don't exclude them, it will completely mess up the data. On the other hand, quotes may still contain information that was deleted by the user who posted it.
Even without quotes, users still post Bitcoin addresses that aren't theirs, for instance when providing evidence on a scammer.

Quote
I think it would help labeling the users and alt accounts throughout the entire forum, and would make it easier to detect which accounts are linked to each other
A smart user would simply use different addresses. An even smarter user would use different wallets, so they don't create a blockchain trail when they make a payment.

As a quick test, 51 out of 9999 posts contain at least one Bitcoin address (starting with 1 or 3, ignoring Bech32).

For now I won't go continue this search. If I ever do, I'll move this discussion to Reputation.
hero member
Activity: 2184
Merit: 891
Leading Crypto Sports Betting and Casino Platform
~

Can you also scrape all the Bitcoin Address used here in forum and the user that uses it? Yes, some users would have used the same wallet as they are just alts of someone (with a lot of investigation just to be proven correct). And I think it would help labeling the users and alt accounts throughout the entire forum, and would make it easier to detect which accounts are linked to each other and which are disobeying campaign rules and even forum rule (enrolling many accounts in a single bounty or sig campaign)
sr. member
Activity: 443
Merit: 350
Very interesting statistics, thank you!

-snip-
Addresses with most receiving transactions
This is the Top 100, the number in front of the address shows how many transactions it has received:
-snip-
 326839 d-d0d953f2e7043342540a1407243e49fe
...
 289070 d-0e9deef32abfc454392d21725f9defef
...
 262539 d-73fd8c31c9fc1d084f44b301bb7adb6a
...
 224217 d-752ed0099932a96fbc0a854a4d3a300f
...
 219174 s-e3b0c44298fc1c149afbf4c8996fb924
-snip-

Can you please clarify, what is the type of these d- and s- addresses?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Some interesting (?) statistics (updated until blockchair_bitcoin_outputs_20200719.tsv.gz)
Total address count: 1,484,589,749
1... address count: 1,039,899,708
3... address count: 343,485,961
bc1q... address count: 55,006,904
...-... (with a "dash") address count: 46,197,161

Unique address count: 693,180,830
1... address count: 470,943,308
3... address count: 167,941,821
bc1q... address count: 39,137,878
...-... (with a "dash") weird address count: 15,157,808

Addresses with most receiving transactions
This is the Top 100, the number in front of the address shows how many transactions it has received:
Code:
4467608 1HckjUpRGcrrRAtFaaCAUaGjsPx9oYmLaZ
1900428 1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4
1601193 1dice8EMZmqKvrGE4Qc9bUFf9PX3xaYDp
1527471 1FoWyxwPXuj4C6abqwhjDWdz6D4PZgYRjA
1204787 1LuckyR1fFHEsXYyx5QK4UFzv3PEAepPMK
1105406 1dice97ECuByXAvqXpaYzSaQuPVvrtmz6
1021575 3CD1QW6fjgTwKq3Pj97nty28WZAVkziNom
1009836 1G47mSr3oANXMafVrR8UC4pzV7FEAzo3r9
 929737 3JXRVxhrk2o9f4w3cQchBLwUeegJBj6BEp
 872274 1J37CY8hcdUXQ1KfBhMCsUVafa8XjDsdCn
 859422 3422VtS7UtCvXYxoXMVp6eZupR252z85oC
 841967 168o1kqNquEJeR9vosUB5fw4eAwcVAgh8P
 832807 1P9RQEr2XeE3PEb44ZE35sfZRRW1JHU8qx
 782811 1VayNert3x1KzbpzMGt2qdqrAThiRovi8
 689574 37Tm3Qz8Zw2VJrheUUhArDAoq58S6YrS3g
 676674 1DUb2YYbQA1jjaNYzVXLZ7ZioEhLXtbUru
 663458 bc1qwqdg6squsna38e46795at95yu9atm8azzmyvckulcc7kytlcckxswvvzej
 631610 17kb7c9ndg7ioSuzMWEHWECdEVUegNkcGc
 595853 1dice9wcMu5hLF4g81u8nioL5mmSHTApw
 580565 1Po1oWkD2LmodfkBYiAktwh76vkF93LKnh
 573787 1LAnF8h3qMGx3TSwNUHVneBZUEpwE4gu3D
 520889 1NDyJtNTjmwk5xPNhjgAMu4HDHigtobu1s
 505956 13vHWR3iLsHeYwT42RnuKYNBoVPrKKZgRv
 448252 1Fi9J5TeaWPHdU5cTJ4e9jr3V58SrWtUuT
 437634 1dice7fUkz5h4z2wPc1wLMPWgB5mDwKDx
 406471 1MPxhNkSzeTNTHSZAibMaS8HS1esmUL1ne
 395663 1dice7W2AicHosf5EL3GFDUVga7TgtPFn
 394249 1LuckyY9fRzcJre7aou7ZhWVXktxjjBb9S
 389038 1D5bPm1YAdn9WvAAixht7PbACU3TtkqtJJ
 376310 17A16QmavnUfCW11DAApiJxp7ARnxN5pGX
 364311 3HNSiAq7wFDaPsYDcUxNSRMD78qVcYKicw
 363898 3MfN5to5K5be2RupWE8rjJHQ6V9L8ypWeh
 357641 3HRZjedwF2AJejNTtgznWnas4E6froNP5r
 354691 1LuckyG4tMMZf64j6ea7JhCz7sDpk6vdcS
 346986 366Dgw4pi3rnvu5zizVWZF6nijWxZWc6RA
 341430 1dice6YgEVBf88erBFra9BHf6ZMoyvG88
 326839 d-d0d953f2e7043342540a1407243e49fe
 325099 38jMiiZs2C5n5MPkyc5pSA7wwW6H4p6hPa
 293567 38ENmTr2AD1avJrmmi9iM7PfS6nZVmuMKf
 289070 d-0e9deef32abfc454392d21725f9defef
 285507 1N52wHoVR79PMDishab2XmRHsbekCdGquK
 282321 3PUuiYu5cFMsagkffArrKZzQFtWdHttU3x
 280691 367f4YWz1VCFaqBqwbTrzwi2b1h2U3w1AF
 280107 1FoxBitjXcBeZUS4eDzPZ7b124q3N7QJK7
 262539 d-73fd8c31c9fc1d084f44b301bb7adb6a
 262317 1Fi57hAqyYYwaQVdA7a9qSKfiukBbt31G3
 253795 1K2SXgApmo9uZoyahvsbSanpVWbzZWVVMF
 252344 1dice5wwEZT2u6ESAdUGG6MHgCpbQqZiy
 251282 3JnFBLxDCutY3bZEZsPTkHAaUA1bxmEMX2
 250862 1diceDCd27Cc22HV3qPNZKwGnZ8QwhLTc
 247797 352zT3Ts9piSDhZpBsDoZMvdtDmJioQNBo
 246472 12JYmnfYU2ghzjwUAspzJsSnmJtK9bZPYR
 243955 1x6YnuBVeeE65dQRZztRWgUPwyBjHCA5g
 240428 3A4U175prUGEn3B1gUDkz32u8fnF9Nx3Ly
 232303 357d4rAjQhDPaWhZrBAFY7aizVPkNSq2DH
 230290 18rdKmjrg1EawxgiVT3ikLExj6GWS2MNCk
 229128 3JjPf13Rd8g6WAyvg8yiPnrsdjJt1NP4FC
 226837 1HWqsgnSd12Gv8SpoUMi1Cj8hp79BTSpW7
 226259 1changemCPo732F6oYUyhbyGtFcNVjprq
 224451 138o15eFWEEPv2ayKW2CZCgVvv5ZaZvomP
 224217 d-752ed0099932a96fbc0a854a4d3a300f
 219697 bc1qnsupj8eqya02nm8v6tmk93zslu2e2z8chlmcej
 219174 s-e3b0c44298fc1c149afbf4c8996fb924
 215870 1Kr6QSydW9bFQG1mXiPNNu6WpJGmUa9i1g
 215691 37p9pUugydmoLpQyFLLqGAgjWmUFERa1Pq
 215520 19iVyH1qUxgywY8LJSbpV4VavjZmyuEyxV
 212059 1dice7EYzJag7SxkdKXLr8Jn14WUb3Cf1
 209001 1F89hmmrtonJfAQNAqDmeDadcw7AsZcvXG
 207701 1NDpZ2wyFekVezssSXv2tmQgmxcoHMUJ7u
 207697 1Bd5wrFxHYRkk4UCFttcPNMYzqJnQKfXUE
 207524 15fXdTyFL1p53qQ8NkrjBqPUbPWvWmZ3G9
 207499 14719bzrTyMvEPcr7ouv9R8utncL9fKJyf
 207424 18uvwkMJsg9cxFEd1QDFgQpoeXWmmSnqSs
 207385 1J4yuJFqozxLWTvnExR4Xxe9W4B89kaukY
 207376 1Bqm5MDo82m1FTxV3qYNUUEKnESPRhk9jd
 207256 1HVpyjYEPwQhvRQ3dL8tGe9kiydti616sX
 207228 17NKcZNXqAbxWsTwB1UJHjc9mQG3yjGALA
 207218 1HjDauL2kth6KJUz5vX198Nvp1xN1hgYRb
 207187 13h1DP2Boo9TAsenphroACxhNy7pGxDYXd
 207138 1MSzmVTBaaSpKDARK3VGvP8v7aCtwZ9zbw
 207053 1GoK6fv4tZKXFiWL9NuHiwcwsi8JAFiwGK
 207006 13HFqPr9Ceh2aBvcjxNdUycHuFG7PReGH4
 206834 1L4EThM6x3Rd2PjNbs1U136FpMq4Gmo3fJ
 206826 14ChPPM8rPYJeHnw6kMVUDnNNKx1KnjYW4
 206808 1AdN2my8NxvGcisPGYeQTAKdWJuUzNkQxG
 206760 1DpsR91YmHUDTtiuH1pPCuG3RqAkmg6YKB
 206707 1PeohaRGaTF8cSzDqP1yYfzDah66xiriEQ
 206664 1JmcV7G3r8k7ev2EkS84MmsvxGyhiRGP84
 206572 1HZHBnH2FbHNWieMxAh4xBPfgfuxW15UPt
 206469 18czPiA9PcCs7rFTBZnhvNAWuh1pEZRpGJ
 206346 12Cf6nCcRtKERh9cQm3Z29c9MWvQuFSxvT
 206344 1MPerpQzTABa1K2eXQxsQTDSZtDQHWf6vk
 206247 1dice1e6pdhLzzWQq7yMidf6j8eAg7pkY
 206243 18XSLnBZ8ydMUkaifU6sQBMJzmm7JvDeUp
 205690 bc1quq29mutxkgxmjfdr7ayj3zd9ad0ld5mrhh89l2
 203334 3QQB6AWxaga6wTs6Xwq8FYppgrGinGu15f
 201993 3M92sq9ssFaNbEwF47uteVKJsbw125juS7
 199135 1AScRhqdXMrJyxNmjEapMZi1PLFsqmLquG
 196271 18p9Ftp3m4435tdpZTvoBsm3yjUgkvTF2b
 193271 33fDiKKhr2F2uRv2jJzdKT3ECuK3wzCq5d
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Background
To follow up on List of all Bitcoin addresses with a balance and this post, I made a list of all Bitcoin addresses that have ever been used.

The data
See alladdresses.loyce.club (new location)
I now have the resources (RAM, CPU power and disk space) and code to show unique addresses in their original order. Each address is only shown once. I have 2 large files:

1. All Bitcoin addresses ever used, in chronological order, without duplicates.
Sample: addresses_in_order_of_first_appearance.txt.gz: (Warning: 18 GB):
Code:
1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa
12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX
1HLoD9E4SDFFPDiYfNYnkBLQ85Y51J3Zb1
.......
3GFfFQAFgXKiA1qqUK6rqBpEpG4vZDos6t
3Mbtv47gZ2eN6Fy7owpgHHwSLYHS42P56P
38JyF2RQknBUMETyRT2yGndDJFYSp6hJNg

2. All Bitcoin addresses ever used, sorted by address, without duplicates.
Sample: addresses_sorted.txt.gz: (Warning: 16 GB):
Code:
1111111111111111111114oLvT2
111111111111111111112BEH2ro
111111111111111111112xT3273
.......
s-ffd80dee5966fb23c1a483b28f6bfcbc
s-fff5d0faa9628c188e97661f0e185fce
s-ffff291613d413b4ac128df96a462294

Updates
Sorting a list that doesn't fit in the server's RAM is slow. Therefore I only update both large files (addresses_sorted.txt.gz and  addresses_in_order_of_first_appearance.txt.gz) twice a month (on the 6th and 21st, updates take more than a day). Check the file date here to see how old it is. If an update fails, please post here.
In between updates, I create daily updates: alladdresses.loyce.club/daily_updates/. These txt-files contain unique addresses (for that day) in order of appearance.
I won't keep older snapshots.

Bandwidth
This server is allowed 5 TB bandwidth per month. If it runs out, it probably gets suspended Tongue I haven't tried that yet.

Credits
Blockchair Database Dumps has a staggering amount of data, easily accessible (at 10 kB/s (or recently 100 kB/s)) with daily updates. All data presented in this topic comes from Blockchair.

No spam please.
Self-moderated against spam. Discussion and questions are welcome.

Q&A
Can you please clarify, what is the type of these d- and s- addresses?
This is how Blockchair.com shows OP_RETURN. From the main page the search field doesn't show them, but you can replace a Bitcoin address in the URL to find them: https://blockchair.com/bitcoin/address/d-d0d953f2e7043342540a1407243e49fe.

Tips and tricks
Some suggestions for Linux/VPS users:
Code:
wget http://alladdresses.loyce.club/addresses_sorted.txt.gz -O - | gunzip > addresses_sorted.txt
This doesn't save the .gz but extracts it while downloading.

Code:
comm -12 <(sort list.txt) addresses_sorted.txt
This outputs all Bitcoin addresses from "list.txt" that have ever been funded.

Code:
comm -12 <(sort list.txt) addresses_sorted.txt > output.txt
This does the same, but writes to output.txt instead of console.
This search is fast, even with millions of addresses in list.txt, it's mainly limited by how fast your computer can read from disk.



Related topics
Bitcoin block data available in CSV format
List of all Bitcoin addresses with a balance
List of all Bitcoin addresses ever used
[~500 GB] Bitcoin block data: inputs, outputs and transactions
Pages:
Jump to: