Pages:
Author

Topic: List of all Bitcoin addresses ever used - currently UNavailable on temp location - page 6. (Read 4161 times)

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Next 9.999 TB / Month   $0.09 per GB (About $500 a year)
Today's counter is at 50 GB (in 15 days), that brings me at $1000 per year if I'd have to pay $0.09 per GB. At current rate, I'll hit the data limit by the end of this month for this VPS, and until now traffic is going up. My current limit is 1 TB/month, and for $0.00067 per GB I can double that.

Quote
Also, you can get a server with 256GB of RAM and 32 processors for $1.50 per hour.  You attach your storage to the VPS, run your queries for however long it takes, then terminate the instance and move your storage back to your existing lower powered system.

Now that you have an established site and case use for AWS services, I can get you $1,000 in AWS credits for your own account, if you are interested.  I'm in training to be certified as a cloud consultant.
The offer is good, but AWS wants my creditcard, which I don't want to link to this. I only use hosting that accepts crypto.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
I don't want to be demanding, and AWS charges (my sponsor) $0.09 per GB. That's okay for HTML, but not for large files. My List of all Bitcoin addresses with a balance alone transferred 450 GB since the start of this year. That would be $1000 per year on AWS, while it costs only a fraction elsewhere. I love how reliable AWS is, it just always works, but that's not necessary for my blockdata.

Full price storage for AWS:
First 50 TB / Month   $0.023 per GB
Next 450 TB / Month   $0.022 per GB
Over 500 TB / Month   $0.021 per GB
You can then reduce these costs up to 72% if you commit to a certain spend.

Data transfer out of AWS:
Up to 1 GB / Month      $0.00 per GB
Next 9.999 TB / Month   $0.09 per GB (About $500 a year)

Consider your data is all alone on your VPS too.  If you were on AWS, you could transfer your data to other AWS clients (like me) $0.01 per GB.  Smiley

Also, you can get a server with 256GB of RAM and 32 processors for $1.50 per hour.  You attach your storage to the VPS, run your queries for however long it takes, then terminate the instance and move your storage back to your existing lower powered system.

Now that you have an established site and case use for AWS services, I can get you $1,000 in AWS credits for your own account, if you are interested.  I'm in training to be certified as a cloud consultant. 

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
This post is the result of some trial&error. I also noticed blockdata.loyce.club/ gets terribly slow once in a while, which made it useless to use RamNode for this data.

Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.
It merges all lines from both sorted files in sorted order. After several tests (on my old desktop with HDD), these are the relevant results:
Code:
Old process:
time cat <(gunzip -c addresses_sorted.txt.gz) daily_updates/*.txt | sort -uS80% | gzip > test1.txt.gz
real    90m2.883s

Faster new process:
time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz
real    51m26.730s
The output is the same.
Interestingly, when I tell sort -m to use up to 40% of my RAM, it actually uses that (even though it doesn't need it), which slows it down by 7 minutes.
Most CPU time is spent compressing the new gzip file.

I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.
This would definitely work and was the solution I originally proposed:
The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.
This would be faster than the bloom filter if there's more than 1 new address that's already in the list.
I'll try:
Code:
Old code:
time cat <(gunzip -c addresses_in_order_of_first_appearance.txt.gz) daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2 | gzip > newchronological.txt.gz
real    194m24.456s

New:
time comm -13 <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) > newaddresses.txt
real    8m4.045s
time cat daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2 > all_daily_addresses_chronological_order.txt
real    1m14.593s
cat all_daily_addresses_chronological_order.txt newaddresses.txt | nl -nln | sort -k2 -S80% > test.txt
real    0m36.948s

I discovered uniq -f1 on stackexchange:
Code:
cat test.txt | uniq -df1 | sort -nk1 -S80% | cut -f2 > test2.txt
real    0m7.721s

Code:
Combined:
time cat <(cat <(cat ../daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2) <(comm -13 <(gunzip -c ../addresses_sorted.txt.gz) <(sort -u ../daily_updates/*.txt)) | nl -nln | sort -k2 -S80%) | uniq -df1 | sort -nk1 -S80% | cut -f2 > newaddresses_chronological.txt
real    9m45.163s
Even more combined:
time cat <(gunzip -c addresses_in_order_of_first_appearance.txt.gz) <(cat <(cat <(cat daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2) <(comm -13 <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt)) | nl -nln | sort -k2 -S80%) | uniq -df1 | sort -nk1 -S80% | cut -f2) > new.alladdresses_chronological.txt
real    19m34.926s
This can significantly improve performance, especially if I keep uncompressed files for faster access. But it's wrong, I have 3 different output files from 3 different methods.

split -l 50000000 sorted ( it will split filesstarting with name xaa next xab....)
I don't see any benefit in splitting files for processing.



I have a server on RAID0 with 882MB/s read and 191MB/s write, so copying this stuff to a different place on the same disk will take about 40 seconds or so for a 30GB dataset.
Dedicated? Cheesy That's the dream Shocked But even then, sorting data means reading and writing the same data several times.

You are on AWS, right?   Why not have your sponsor upgrade your instance to a higher class for a few hours?  That's the beauty of on-demand processing. Smiley
I don't want to be demanding, and AWS charges (my sponsor) $0.09 per GB. That's okay for HTML, but not for large files. My List of all Bitcoin addresses with a balance alone transferred 450 GB since the start of this year. That would be $1000 per year on AWS, while it costs only a fraction elsewhere. I love how reliable AWS is, it just always works, but that's not necessary for my blockdata.
Vod
legendary
Activity: 3668
Merit: 3010
Licking my boob since 1970
It can be done by awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read 30 quadrillion bytes. Either way, I can't test it due to lack of RAM... it's not that expensive to use it a couple hours per month only.

You are on AWS, right?   Why not have your sponsor upgrade your instance to a higher class for a few hours?  That's the beauty of on-demand processing. Smiley

EC2?   Those are dedicated resources, not shared.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
(I can throw up to 300GB for this project). I can also set up an rsync cron job to pull updates from your temporary location too.
It is a good offer Smiley Currently, disk space isn't the problem. I am looking for a webhost that allows me to abuse the disk for a few hours continuously once in a while. Most VPS providers aren't that happy when I do that, and my (sponsored) AWS server starts throttling I/O when I do this.
I'm (again) short on time to test everything, but after some discussion on another forum I created a RamNode account. This looks promising so far. If I can pull that off by automating everything, it's not that expensive to use it a couple hours per month only.

I have a server on RAID0 with 882MB/s read and 191MB/s write, so copying this stuff to a different place on the same disk will take about 40 seconds or so for a 30GB dataset.

AWS VPS's run on shared hardware so that's probably why you're getting throttled. There are dedicated servers on AWS you can get where you're in total control over the hardware and they don't throttle you and stuff. But I'm glad the RamNode account worked out for you. Let me know if you need help writing automation stuff.
member
Activity: 348
Merit: 34
you discussion about sort and remove duplicate , and make list available raw and sorted
my system is i3-6100 processor with 16gb ddr4 ram, and i am managing there all sort and remove duplicate from raw 19gb file within 1 hour, on work daily data is just few min job
let me explain
simple do
sort  raw.txt >> sorted.txt
split -l 50000000 sorted ( it will split filesstarting with name xaa next xab....)
next is remove duplicate by perl for fast and aprox can load 3gb file, but we make it more fast by selecting 50m lines
perl -ne'print unless $_{$_}++' xaa > part1.txt
2nd file
perl -ne'print unless $_{$_}++' xab > part2.txt
last you have compelete all files within 1 hour

now combine all file
cat part*.txt >> full-sorted.txt
or like sorted ( selected all part1.txt... part10.txt )
cat part1.txt part2.txt part3.txt >> full-sorted.txt

stage 2
2nd group you can continuous onword 21 dec 2020, all daily update files, combine, sort and remove duplicate
you name it new-group.txt

command is
join new-group.txt full-sorted.txt >> filter1.txt

here filter.txt is common on 2 files(new-group.txt and full-sorted.txt)
now need remove filter.txt from newgroup.txt for get pure only new addresses

awk 'FNR==NR{ a[$1]; next } !($1 in a)' filter.txt new-group.txt >> pure-new-addresses.txt

stage 3
if you still need all in one file

combine pure-new-address.txt and full-sorted.txt
cat pure-new-address.txt full-sorted.txt >> pre-full-sorted.txt
sort pre-full-sorted.txt >> new-full-addresses


its recomemnded leave 1 file as last created on 21 dec 2020
start 2nd file onword,  perform only stage 2, you will have only new addresses which is no apear in first 19gb file

hope i try to explain all points, and will help you and community , any further info, ask me , love to provide info what ever i have
newbie
Activity: 6
Merit: 15
Yes. In fact, just 2 days ago (on another forum) I was pointed at the existence of "sort -mu":
Code:
       -m, --merge
              merge already sorted files; do not sort
This does exactly what you described. I haven't tested it yet, but I assume it's much faster than "regular" sort.
Update: I'm testing this now.

Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.

I do see that for the other list it might be a bit more difficult...

It can be done by awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read 30 quadrillion bytes. Either way, I can't test it due to lack of RAM.

I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB).
Some quick math:
1GB: 1 in 13 false positives
2GB: 1 in ~170
3GB: 1 in ~2,200
4GB: 1 in ~28,000
5GB: 1 in ~365,000
6GB: 1 in ~4,700,000
7GB: 1 in ~61,000,000
8GB: 1 in ~800,000,000


Of course this would require some hashing overhead, but this should greatly outweigh looping over your 1.5 billion addresses. Unfortunately you'd still have to double check any positives, because they might be false.

I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.
This would definitely work and was the solution I originally proposed:
The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.
This would be faster than the bloom filter if there's more than 1 new address that's already in the list.

By the way, I just checked out (but not downloaded) the daily file on blockchair. It's close to 1GB (compressed), but you mentioned 20MB for new addresses on numerous occasions. I guess there's a lot of cleaning to do there. Could I maybe get one of your (old) daily files? I should be able to throw some code together that makes this work, fairly quickly.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
We can read n from disk line by line and compare it to the current position in k.
Yes. In fact, just 2 days ago (on another forum) I was pointed at the existence of "sort -mu":
Code:
       -m, --merge
              merge already sorted files; do not sort
This does exactly what you described. I haven't tested it yet, but I assume it's much faster than "regular" sort.
Update: I'm testing this now.

However, the bigger problem remains: updating 1.5 billion unique addresses in chronological order. Those lists are unsorted, so for example:
Existing long list with 12 years of data:
Code:
5
3
7
2
9
New daily list:
Code:
4
3
The end result should be:
Code:
5
3
7
2
9
4
It can be done by awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read 30 quadrillion bytes. Either way, I can't test it due to lack of RAM.
I ended up with sort -uk2 | sort -nk1 | cut -f2.
I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.



I don't remember if I offered you this before but I can host this data for you if it's not too big
You did (more or less):
If network bandwidth is a problem I'm able to host this on my hardware if you like.
So I guess you missed my reply too:
I'm more in need for more disk space for sorting this data, but I haven't decided yet where to host it.

(I can throw up to 300GB for this project). I can also set up an rsync cron job to pull updates from your temporary location too.
It is a good offer Smiley Currently, disk space isn't the problem. I am looking for a webhost that allows me to abuse the disk for a few hours continuously once in a while. Most VPS providers aren't that happy when I do that, and my (sponsored) AWS server starts throttling I/O when I do this.
I'm (again) short on time to test everything, but after some discussion on another forum I created a RamNode account. This looks promising so far. If I can pull that off by automating everything, it's not that expensive to use it a couple hours per month only.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Due to another VPS that decided to run off with my prepayment (Lol: for 2 weeks), this data is currently unavailable. I'm not sure yet where to move, if it takes too long I'll upload the data elsewhere (but in that case without regular backups).

Update:
I've uploaded the latest version to a temporary location: blockdata.loyce.club/alladdresses/.

I don't remember if I offered you this before but I can host this data for you if it's not too big (I can throw up to 300GB for this project). I can also set up an rsync cron job to pull updates from your temporary location too.
newbie
Activity: 6
Merit: 15
We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory.
The problem with this is that running through a 20 MB list takes a lot of time if you need to do it 1.5 billion times. Keeping the 20 MB in memory isn't the problem, reading 30 quadrillion bytes from RAM still takes much longer than my current system.

(...)

I might be utterly mistaking, but hear me out:

Given two sorted lists:
n = 1 5 10 11 12 13 14 15 16 19 20
k = 3 6 18

We can read n from disk line by line and compare it to the current position in k.

1 < 3, write 1 to new file.
5 > 3, write 3 to file.
5 < 6, write 5 to file.
10 > 6, write 6 to file.
10 < 18, write 10 to file.
11 < 18, write 11 to file.
....
16 < 18, write 16 to file.
19 > 18, write 18 to file.
19 & nothing left in k, write 19 to file.
20 & nothing left in k, write 20 to file.

That's n + k instead of n * k, right?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory.
The problem with this is that running through a 20 MB list takes a lot of time if you need to do it 1.5 billion times. Keeping the 20 MB in memory isn't the problem, reading 30 quadrillion bytes from RAM still takes much longer than my current system.

I may be able to improve on the sorted list by merging lists, and I may be able to improve on everything by keeping big temp files instead of only compressed files (but as always I need some time to do this).

Quote
Have you considered releasing the big files as torrents with a webseed? This will allow downloaders to still download from your server and then (hopefully) continue to seed for a while; taking some strain of your server.
No, until now download bandwidth isn't a problem. Only a few people have been crazy enough to download these files. If this ever goes viral it would be a great solution though.
newbie
Activity: 6
Merit: 15
First of all, great project!



(...)
Quote
The longer the list, the longer it will take to sort one additional line.
At some point a database might beat raw text sorting, but for now I'm good with this Smiley
Using a database will not solve this problem. There are some things a DB can do to make sorting go from O^2 to O^2/n, but this is still exponential growth.

You make the argument that your input size is sufficiently small such that having exponential complexity is okay, and you may have a point.
Going with these two versions:
(...)
Since I got no response to my question above, I'll go with 2 versions:
  • All addresses ever used, without duplicates, in order of first appearance.
  • All addresses ever used, without duplicates, sorted.
The first file feels nostalgic, the second file will be very convenient to match addresses with a list of your own.

I don't see how sorting would be exponential for any of these lists..

All addresses ever used, without duplicates, sorted.
  • We already have a list with all the addresses ever used sorted by address (length n).
  • We have a list of (potentially) new addresses (length k).
  • We sort the list of new items in O(k log k).
  • We check for duplicates in the new addresses in O(k).
  • We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory.

Resulting in O(n + k log k + 2k). In this particular case one might even argue that n > k log k + 2k, therefore O(2n) = O(n) However, it's late here and I don't like to argue.

You only need enough memory to keep the new addresses in memory and enough disk space to keep both the new and old version on disk at the same time.

The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.

I'll see if I can whip some code together.


File hosting
Have you considered releasing the big files as torrents with a webseed? This will allow downloaders to still download from your server and then (hopefully) continue to seed for a while; taking some strain of your server.

You might even release it in a RSS feed so that some contributors could automatically add it to their torrent clients and start downloading with e.g. max 1 Mb/s and uploading with >1Mb/s, this will quickly allow the files to spread over the peers and further move downloads away from your server.


legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
daily updates also need to be post there, if possible
This VPS is currently downloading other data from Blockchair, which only allows once connection at a time. I expect this to take another month (at 100 kB/s), after that I can enable daily updates (txt-files with unique addresses for that day) again.

I haven't decided yet how and where to do regular updates to the 20 GB files (this is quite resource intensive).
member
Activity: 348
Merit: 34
Due to another VPS that decided to run off with my prepayment (Lol: for 2 weeks), this data is currently unavailable. I'm not sure yet where to move, if it takes too long I'll upload the data elsewhere (but in that case without regular backups).

Update:
I've uploaded the latest version to a temporary location: blockdata.loyce.club/alladdresses/.

daily updates also need to be post there, if possible,
Thankx
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Due to another VPS that decided to run off with my prepayment (Lol: for 2 weeks), this data is currently unavailable. I'm not sure yet where to move, if it takes too long I'll upload the data elsewhere (but in that case without regular backups).

Update:
I've uploaded the latest version to a temporary location: blockdata.loyce.club/alladdresses/.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I am not sure what level of access you have to the AWS account sponsoring your site.
Just root access to loyce.club, but addresses.loyce.club and alladdresses.loyce.club aren't hosted at AWS. This month so far, they've passed 1 TB of traffic, so it was a good call not to use AWS (this would cost $90).

Quote
However, it is possible to setup a storage bucket so that anyone can access it, but that the requestors IP address is among the IP addresses of the same region the files are stored in.
That seems like overkill for this.

Quote
Using a database will not solve this problem. There are some things a DB can do to make sorting go from O^2 to O^2/n, but this is still exponential growth.
For a database it would only mean checking and adding 750k addresses per day, instead of sorting the entire data again. I expect sort to take less long too when the majority of ("old") data is already sorted, but haven't tested for speed differences.

Quote
AWS is very reliable.
I have never experienced any downtime with AWS, unlike all VPS providers I've ever used. Those "external projects" don't have much priority to me, if it's down I don't lose scraping data.

Quote
This works out to approximately a 24-minute download. I measured a download speed of ~125 Mbps using a colab instance.
It's doing the biweekly data update, that probably slowed it down too.
copper member
Activity: 1666
Merit: 1901
Amazon Prime Member #7
I had used AWS as an example because I believed you used it for some of your other projects.
Correct, loyce.club runs on AWS (sponsored).

Quote
Yes, transferring data to the internet is very expensive. You can use a CDN (content delivery network) to reduce costs a little bit. 5 TB of data is a lot.
I highly doubt I'd find a cheaper deal Cheesy I hope not to use the full 5 TB though, I expect some overselling and don't want to push it to the limit.
I am not sure what level of access you have to the AWS account sponsoring your site. However, it is possible to setup a storage bucket so that anyone can access it, but that the requestors IP address is among the IP addresses of the same region the files are stored in. See this stack overflow discussion. You can also setup the storage bucket such that the requestor pays for egress traffic.


Quote
The longer the list, the longer it will take to sort one additional line.
At some point a database might beat raw text sorting, but for now I'm good with this Smiley
Using a database will not solve this problem. There are some things a DB can do to make sorting go from O^2 to O^2/n, but this is still exponential growth.

You make the argument that your input size is sufficiently small such that having exponential complexity is okay, and you may have a point.



I was under the impression that traffic out of the AWS network (for AWS) will count as egress traffic, and will be billed accordingly.
AWS charges $0.09/GB, and especially since this one is sponsored, I don't want to abuse it. I love how stable the server is though, it has never been down.
AWS is very reliable. I would not expect much downtime when using AWS or other major cloud providers. Egress traffic is very expensive though.

Downloads are fast, I've seen 20-100 MB/s. Enjoy Smiley

This works out to approximately a 24-minute download. I measured a download speed of ~125 Mbps using a colab instance.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I'm glad to see this service is being used too:
Image loading...

I'd love to hear feedback (because I'm curious): what are you guys using this for?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
It took a while, and the new VPS got a lot slower by now, but I've enabled updates again:
Updates
Sorting a list that doesn't fit in the server's RAM is slow. Therefore I only update both large files (addresses_sorted.txt.gz and  addresses_in_order_of_first_appearance.txt.gz) twice a month (on the 6th and 21st, updates take more than a day). Check the file date here to see how old it is. If an update fails, please post here.
In between updates, I create daily updates: alladdresses.loyce.club/daily_updates/. These txt-files contain unique addresses (for that day) in order of appearance.
I won't keep older snapshots.
Downloads are fast, I've seen 20-100 MB/s. Enjoy Smiley

My latest count: 764,534,424 Bitcoin addresses have been used.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
There's a problem though. There are:
756,494,121 addresses according to addresses_in_order_of_first_appearance.txt.gz
756,524,407 addresses according to addresses_sorted.txt.gz
Obviously, these numbers should be the same. I haven't scheduled automated updates yet, I first want to recreate this data from scratch to see which number is correct.
After recreating this data, I now have 757,437,766 unique addresses (don't click this link unless you want to download 18 GB).
My next step would be to add a few days of data, and count addresses again. Next, I'll recreate all data "from scratch", and see if I end up with the same numbers. I don't know why there's a difference, and I don't like loose ends in my data.
Pages:
Jump to: