Pages:
Author

Topic: List of all Bitcoin addresses ever used - currently UNavailable on temp location - page 5. (Read 3605 times)

newbie
Activity: 6
Merit: 15
I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.

I found a bit of time to write this. Testing it now..

Just to check with you, I was sorta wrong here:
Given two sorted lists:
n = 1 5 10 11 12 13 14 15 16 19 20
k = 3 6 18

We can read n from disk line by line and compare it to the current position in k.

1 < 3, write 1 to new file.
5 > 3, write 3 to file.
5 < 6, write 5 to file.
10 > 6, write 6 to file.
10 < 18, write 10 to file.
11 < 18, write 11 to file.
....
16 < 18, write 16 to file.
19 > 18, write 18 to file.
19 & nothing left in k, write 19 to file.
20 & nothing left in k, write 20 to file.

That's n + k instead of n * k, right?

Since we're sorting as strings it would actually be:
n = 1 10 11 12 13 14 15 16 19 20 5
k = 18 3 6

The whole list would then become:
all = 1 10 11 12 13 14 15 16 18 19 20 3 5 6

Correct?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
we host high-end enterprise, just in case you need some space or mirrors, you’re welcome if ever in need.
For now, I'm covered for bandwidth, thanks Smiley

So the real solution (fell asleep while studying the dataset  Cheesy) is to take the transaction_hex field and pass it as the argument to the "decoderawtransaction" RPC call. It'll return JSON where the signature script is located at [N]["vin"]["scriptSig"]["hex"] for each input index N and then get the compressed public key in the last 33 bytes of the hex.
And this just went over my head Tongue

Quote
You'll obviously need a bitcoind for that and it's possible to configure access from another machine if you're resource-constrained on the box you're parsing this address data on.

I think bitcoind should be able to handle the load, especially if it's running locally.
Although I'd like to be able to extract all data myself from Bitcoin Core (so I don't need to rely on Blockchair anymore), it also makes it much more complicated. So for now, I'll pass on this.
And I don't want to add more local data processing to what I'm doing already. If anything, I want to move more to a VPS.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Is it possible to link the public key for every bitcoin address in your database?
If I can get the data I can add it. I'm no expert on this, can I use anything from inputs (maybe spending_signature_hex?) to get this data?

I looked for compressed keys at the end of the spending_signautre_hex values and I found that a lot of them don't have public keys at the end. Makes me think they are signatures of transactions, not scripts.

So the real solution (fell asleep while studying the dataset  Cheesy) is to take the transaction_hex field and pass it as the argument to the "decoderawtransaction" RPC call. It'll return JSON where the signature script is located at [N]["vin"]["scriptSig"]["hex"] for each input index N and then get the compressed public key in the last 33 bytes of the hex.

You'll obviously need a bitcoind for that and it's possible to configure access from another machine if you're resource-constrained on the box you're parsing this address data on.

I think bitcoind should be able to handle the load, especially if it's running locally.
jr. member
Activity: 36
Merit: 3
Just want to say what a great job you did. We use your data to build graphs and do some fun stuff (download each month twice to be not so demanding on your bandwidth).

We were building a pubkey list too, but wasn’t worth the effort at the end in our part (wasn’t much fun you could really do with it).

For living we host high-end enterprise, just in case you need some space or mirrors, you’re welcome if ever in need.

Thanks  Wink
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Is it possible to link the public key for every bitcoin address in your database?
If I can get the data I can add it. I'm no expert on this, can I use anything from inputs (maybe spending_signature_hex?) to get this data?
sr. member
Activity: 443
Merit: 350
Hi! Is it possible to link the public key for every bitcoin address in your database? (of course only for whose of them where public key was exposed).
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
You can have full control of an account minus billing.  I can pay the bill and accept crypto.
It's really not worth it for this project. I prefer to pay a low amount once a year, and once it reaches it's data limit, it just shuts down until the next month starts.

You could give pigz a try, see: https://unix.stackexchange.com/a/88739/314660. I'm not sure what the drawbacks would be, I"ve never tried pigz myself.
Parallel compression is only useful when server load isn't a restriction. For now I stick to the standard.
newbie
Activity: 6
Merit: 15
Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.
It merges all lines from both sorted files in sorted order. After several tests (on my old desktop with HDD), these are the relevant results:
Code:
Old process:
time cat <(gunzip -c addresses_sorted.txt.gz) daily_updates/*.txt | sort -uS80% | gzip > test1.txt.gz
real    90m2.883s

Faster new process:
time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz
real    51m26.730s
The output is the same.
Interestingly, when I tell sort -m to use up to 40% of my RAM, it actually uses that (even though it doesn't need it), which slows it down by 7 minutes.
Most CPU time is spent compressing the new gzip file.
That's a significant improvement. You could give pigz a try, see: https://unix.stackexchange.com/a/88739/314660. I'm not sure what the drawbacks would be, I"ve never tried pigz myself.

Quote
I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB).
That's going over my head, and probably far too complicated for something this simple.
Honestly, the bloomfilter was a silly suggestion. It will probably not be a big improvement (if any) compared to your current code.

I use Blockchair's daily outputs to update this, not the daily list of addresses.
See: http://blockdata.loyce.club/alladdresses/daily_updates/ for old daily files.
Thanks! Hoping to do some experimenting soon (if I have the time...)
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
The offer is good, but AWS wants my creditcard, which I don't want to link to this. I only use hosting that accepts crypto.

Hello!  (waving)   You can have full control of an account minus billing.  I can pay the bill and accept crypto.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Next 9.999 TB / Month   $0.09 per GB (About $500 a year)
Today's counter is at 50 GB (in 15 days), that brings me at $1000 per year if I'd have to pay $0.09 per GB. At current rate, I'll hit the data limit by the end of this month for this VPS, and until now traffic is going up. My current limit is 1 TB/month, and for $0.00067 per GB I can double that.

Quote
Also, you can get a server with 256GB of RAM and 32 processors for $1.50 per hour.  You attach your storage to the VPS, run your queries for however long it takes, then terminate the instance and move your storage back to your existing lower powered system.

Now that you have an established site and case use for AWS services, I can get you $1,000 in AWS credits for your own account, if you are interested.  I'm in training to be certified as a cloud consultant.
The offer is good, but AWS wants my creditcard, which I don't want to link to this. I only use hosting that accepts crypto.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
I don't want to be demanding, and AWS charges (my sponsor) $0.09 per GB. That's okay for HTML, but not for large files. My List of all Bitcoin addresses with a balance alone transferred 450 GB since the start of this year. That would be $1000 per year on AWS, while it costs only a fraction elsewhere. I love how reliable AWS is, it just always works, but that's not necessary for my blockdata.

Full price storage for AWS:
First 50 TB / Month   $0.023 per GB
Next 450 TB / Month   $0.022 per GB
Over 500 TB / Month   $0.021 per GB
You can then reduce these costs up to 72% if you commit to a certain spend.

Data transfer out of AWS:
Up to 1 GB / Month      $0.00 per GB
Next 9.999 TB / Month   $0.09 per GB (About $500 a year)

Consider your data is all alone on your VPS too.  If you were on AWS, you could transfer your data to other AWS clients (like me) $0.01 per GB.  Smiley

Also, you can get a server with 256GB of RAM and 32 processors for $1.50 per hour.  You attach your storage to the VPS, run your queries for however long it takes, then terminate the instance and move your storage back to your existing lower powered system.

Now that you have an established site and case use for AWS services, I can get you $1,000 in AWS credits for your own account, if you are interested.  I'm in training to be certified as a cloud consultant. 

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
This post is the result of some trial&error. I also noticed blockdata.loyce.club/ gets terribly slow once in a while, which made it useless to use RamNode for this data.

Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.
It merges all lines from both sorted files in sorted order. After several tests (on my old desktop with HDD), these are the relevant results:
Code:
Old process:
time cat <(gunzip -c addresses_sorted.txt.gz) daily_updates/*.txt | sort -uS80% | gzip > test1.txt.gz
real    90m2.883s

Faster new process:
time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz
real    51m26.730s
The output is the same.
Interestingly, when I tell sort -m to use up to 40% of my RAM, it actually uses that (even though it doesn't need it), which slows it down by 7 minutes.
Most CPU time is spent compressing the new gzip file.

I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.
This would definitely work and was the solution I originally proposed:
The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.
This would be faster than the bloom filter if there's more than 1 new address that's already in the list.
I'll try:
Code:
Old code:
time cat <(gunzip -c addresses_in_order_of_first_appearance.txt.gz) daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2 | gzip > newchronological.txt.gz
real    194m24.456s

New:
time comm -13 <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) > newaddresses.txt
real    8m4.045s
time cat daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2 > all_daily_addresses_chronological_order.txt
real    1m14.593s
cat all_daily_addresses_chronological_order.txt newaddresses.txt | nl -nln | sort -k2 -S80% > test.txt
real    0m36.948s

I discovered uniq -f1 on stackexchange:
Code:
cat test.txt | uniq -df1 | sort -nk1 -S80% | cut -f2 > test2.txt
real    0m7.721s

Code:
Combined:
time cat <(cat <(cat ../daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2) <(comm -13 <(gunzip -c ../addresses_sorted.txt.gz) <(sort -u ../daily_updates/*.txt)) | nl -nln | sort -k2 -S80%) | uniq -df1 | sort -nk1 -S80% | cut -f2 > newaddresses_chronological.txt
real    9m45.163s
Even more combined:
time cat <(gunzip -c addresses_in_order_of_first_appearance.txt.gz) <(cat <(cat <(cat daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2) <(comm -13 <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt)) | nl -nln | sort -k2 -S80%) | uniq -df1 | sort -nk1 -S80% | cut -f2) > new.alladdresses_chronological.txt
real    19m34.926s
This can significantly improve performance, especially if I keep uncompressed files for faster access. But it's wrong, I have 3 different output files from 3 different methods.

split -l 50000000 sorted ( it will split filesstarting with name xaa next xab....)
I don't see any benefit in splitting files for processing.



I have a server on RAID0 with 882MB/s read and 191MB/s write, so copying this stuff to a different place on the same disk will take about 40 seconds or so for a 30GB dataset.
Dedicated? Cheesy That's the dream Shocked But even then, sorting data means reading and writing the same data several times.

You are on AWS, right?   Why not have your sponsor upgrade your instance to a higher class for a few hours?  That's the beauty of on-demand processing. Smiley
I don't want to be demanding, and AWS charges (my sponsor) $0.09 per GB. That's okay for HTML, but not for large files. My List of all Bitcoin addresses with a balance alone transferred 450 GB since the start of this year. That would be $1000 per year on AWS, while it costs only a fraction elsewhere. I love how reliable AWS is, it just always works, but that's not necessary for my blockdata.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
It can be done by awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read 30 quadrillion bytes. Either way, I can't test it due to lack of RAM... it's not that expensive to use it a couple hours per month only.

You are on AWS, right?   Why not have your sponsor upgrade your instance to a higher class for a few hours?  That's the beauty of on-demand processing. Smiley

EC2?   Those are dedicated resources, not shared.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
(I can throw up to 300GB for this project). I can also set up an rsync cron job to pull updates from your temporary location too.
It is a good offer Smiley Currently, disk space isn't the problem. I am looking for a webhost that allows me to abuse the disk for a few hours continuously once in a while. Most VPS providers aren't that happy when I do that, and my (sponsored) AWS server starts throttling I/O when I do this.
I'm (again) short on time to test everything, but after some discussion on another forum I created a RamNode account. This looks promising so far. If I can pull that off by automating everything, it's not that expensive to use it a couple hours per month only.

I have a server on RAID0 with 882MB/s read and 191MB/s write, so copying this stuff to a different place on the same disk will take about 40 seconds or so for a 30GB dataset.

AWS VPS's run on shared hardware so that's probably why you're getting throttled. There are dedicated servers on AWS you can get where you're in total control over the hardware and they don't throttle you and stuff. But I'm glad the RamNode account worked out for you. Let me know if you need help writing automation stuff.
member
Activity: 310
Merit: 34
you discussion about sort and remove duplicate , and make list available raw and sorted
my system is i3-6100 processor with 16gb ddr4 ram, and i am managing there all sort and remove duplicate from raw 19gb file within 1 hour, on work daily data is just few min job
let me explain
simple do
sort  raw.txt >> sorted.txt
split -l 50000000 sorted ( it will split filesstarting with name xaa next xab....)
next is remove duplicate by perl for fast and aprox can load 3gb file, but we make it more fast by selecting 50m lines
perl -ne'print unless $_{$_}++' xaa > part1.txt
2nd file
perl -ne'print unless $_{$_}++' xab > part2.txt
last you have compelete all files within 1 hour

now combine all file
cat part*.txt >> full-sorted.txt
or like sorted ( selected all part1.txt... part10.txt )
cat part1.txt part2.txt part3.txt >> full-sorted.txt

stage 2
2nd group you can continuous onword 21 dec 2020, all daily update files, combine, sort and remove duplicate
you name it new-group.txt

command is
join new-group.txt full-sorted.txt >> filter1.txt

here filter.txt is common on 2 files(new-group.txt and full-sorted.txt)
now need remove filter.txt from newgroup.txt for get pure only new addresses

awk 'FNR==NR{ a[$1]; next } !($1 in a)' filter.txt new-group.txt >> pure-new-addresses.txt

stage 3
if you still need all in one file

combine pure-new-address.txt and full-sorted.txt
cat pure-new-address.txt full-sorted.txt >> pre-full-sorted.txt
sort pre-full-sorted.txt >> new-full-addresses


its recomemnded leave 1 file as last created on 21 dec 2020
start 2nd file onword,  perform only stage 2, you will have only new addresses which is no apear in first 19gb file

hope i try to explain all points, and will help you and community , any further info, ask me , love to provide info what ever i have
newbie
Activity: 6
Merit: 15
Yes. In fact, just 2 days ago (on another forum) I was pointed at the existence of "sort -mu":
Code:
       -m, --merge
              merge already sorted files; do not sort
This does exactly what you described. I haven't tested it yet, but I assume it's much faster than "regular" sort.
Update: I'm testing this now.

Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.

I do see that for the other list it might be a bit more difficult...

It can be done by awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read 30 quadrillion bytes. Either way, I can't test it due to lack of RAM.

I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB).
Some quick math:
1GB: 1 in 13 false positives
2GB: 1 in ~170
3GB: 1 in ~2,200
4GB: 1 in ~28,000
5GB: 1 in ~365,000
6GB: 1 in ~4,700,000
7GB: 1 in ~61,000,000
8GB: 1 in ~800,000,000


Of course this would require some hashing overhead, but this should greatly outweigh looping over your 1.5 billion addresses. Unfortunately you'd still have to double check any positives, because they might be false.

I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.
This would definitely work and was the solution I originally proposed:
The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.
This would be faster than the bloom filter if there's more than 1 new address that's already in the list.

By the way, I just checked out (but not downloaded) the daily file on blockchair. It's close to 1GB (compressed), but you mentioned 20MB for new addresses on numerous occasions. I guess there's a lot of cleaning to do there. Could I maybe get one of your (old) daily files? I should be able to throw some code together that makes this work, fairly quickly.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
We can read n from disk line by line and compare it to the current position in k.
Yes. In fact, just 2 days ago (on another forum) I was pointed at the existence of "sort -mu":
Code:
       -m, --merge
              merge already sorted files; do not sort
This does exactly what you described. I haven't tested it yet, but I assume it's much faster than "regular" sort.
Update: I'm testing this now.

However, the bigger problem remains: updating 1.5 billion unique addresses in chronological order. Those lists are unsorted, so for example:
Existing long list with 12 years of data:
Code:
5
3
7
2
9
New daily list:
Code:
4
3
The end result should be:
Code:
5
3
7
2
9
4
It can be done by awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read 30 quadrillion bytes. Either way, I can't test it due to lack of RAM.
I ended up with sort -uk2 | sort -nk1 | cut -f2.
I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.



I don't remember if I offered you this before but I can host this data for you if it's not too big
You did (more or less):
If network bandwidth is a problem I'm able to host this on my hardware if you like.
So I guess you missed my reply too:
I'm more in need for more disk space for sorting this data, but I haven't decided yet where to host it.

(I can throw up to 300GB for this project). I can also set up an rsync cron job to pull updates from your temporary location too.
It is a good offer Smiley Currently, disk space isn't the problem. I am looking for a webhost that allows me to abuse the disk for a few hours continuously once in a while. Most VPS providers aren't that happy when I do that, and my (sponsored) AWS server starts throttling I/O when I do this.
I'm (again) short on time to test everything, but after some discussion on another forum I created a RamNode account. This looks promising so far. If I can pull that off by automating everything, it's not that expensive to use it a couple hours per month only.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Due to another VPS that decided to run off with my prepayment (Lol: for 2 weeks), this data is currently unavailable. I'm not sure yet where to move, if it takes too long I'll upload the data elsewhere (but in that case without regular backups).

Update:
I've uploaded the latest version to a temporary location: blockdata.loyce.club/alladdresses/.

I don't remember if I offered you this before but I can host this data for you if it's not too big (I can throw up to 300GB for this project). I can also set up an rsync cron job to pull updates from your temporary location too.
newbie
Activity: 6
Merit: 15
We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory.
The problem with this is that running through a 20 MB list takes a lot of time if you need to do it 1.5 billion times. Keeping the 20 MB in memory isn't the problem, reading 30 quadrillion bytes from RAM still takes much longer than my current system.

(...)

I might be utterly mistaking, but hear me out:

Given two sorted lists:
n = 1 5 10 11 12 13 14 15 16 19 20
k = 3 6 18

We can read n from disk line by line and compare it to the current position in k.

1 < 3, write 1 to new file.
5 > 3, write 3 to file.
5 < 6, write 5 to file.
10 > 6, write 6 to file.
10 < 18, write 10 to file.
11 < 18, write 11 to file.
....
16 < 18, write 16 to file.
19 > 18, write 18 to file.
19 & nothing left in k, write 19 to file.
20 & nothing left in k, write 20 to file.

That's n + k instead of n * k, right?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory.
The problem with this is that running through a 20 MB list takes a lot of time if you need to do it 1.5 billion times. Keeping the 20 MB in memory isn't the problem, reading 30 quadrillion bytes from RAM still takes much longer than my current system.

I may be able to improve on the sorted list by merging lists, and I may be able to improve on everything by keeping big temp files instead of only compressed files (but as always I need some time to do this).

Quote
Have you considered releasing the big files as torrents with a webseed? This will allow downloaders to still download from your server and then (hopefully) continue to seed for a while; taking some strain of your server.
No, until now download bandwidth isn't a problem. Only a few people have been crazy enough to download these files. If this ever goes viral it would be a great solution though.
Pages:
Jump to: