Pages:
Author

Topic: LoyceV's small Linux commands for handling big data - page 2. (Read 896 times)

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Any updates?
Nope:
I don't know how to proceed. If you can figure it out, I'll continue this, but I don't have the time to search for it myself now.
newbie
Activity: 24
Merit: 9
Hi LoyceV,

Any updates?

Regards,
newbie
Activity: 24
Merit: 9
That is a good idea. You'll have to search for both the uncompressed and compressed keys in the list.


It is not like leading zero removed somewhere during the process?
I though pubkeys should always have 65 or 33 bytes (including the one for flag).
I'm not sure. If there are leading zeros when the pubkey is "shorter", I may be able to include them by simply counting from the right instead of from the left.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
It is not like leading zero removed somewhere during the process?
I though pubkeys should always have 65 or 33 bytes (including the one for flag).
I'm not sure. If there are leading zeros when the pubkey is "shorter", I may be able to include them by simply counting from the right instead of from the left.
legendary
Activity: 952
Merit: 1385
Keep the records that are 131 and 66 characters long. 131 chars are for the uncompress pubkey and 66 chars for the compressed pubkey. The uncompressed pubkey prefix is 04 and the compressed pukey is 02 or 03.
This one is 130 characters:
Code:
0405eec604993048314294f7c1f9b45c3ed8424ef940426336153831f8813228a788f845e1df353c2021174573e33f2fab05d94e1dd5e5449832ec83ac3d5db17e
The compressed ones are indeed only half the size. But I meant the ones where the length varies by only 2, which is probably caused by truncating the first 148 characters.
Could it be the 148 characters isn't always the same?
Compare those 2:
Code:
483045022100a83ca95b6b3153c5fce971c1eebbeebc892ba6c297157c326a8359c9b408ce1902201904060ce4e1fbd455403546232779dc9ca7bfe3582d3055270f27f245575d0901410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4
47304402202a6531662b10fdc9d4b4e7ada1a4db2cdba6e0d22155ebbcedb8545a9ddfd1170220760a431d5eac0f32e8f6fc5812cc9ea5c1530eac37f11768288185bb87cba3ea01410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4

Hmmm
It is not like leading zero removed somewhere during the process?
I though pubkeys should always have 65 or 33 bytes (including the one for flag).
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
It seems like it's not always 148 chars that can be truncated.
It seems like it Smiley Combined with the fact that only a part of the spending_witness data is the pubkey, and that the pubkey length itself can vary, I don't know how to proceed. If you can figure it out, I'll continue this, but I don't have the time to search for it myself now.
newbie
Activity: 24
Merit: 9
It seems like it's not always 148 chars that can be truncated.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Keep the records that are 131 and 66 characters long. 131 chars are for the uncompress pubkey and 66 chars for the compressed pubkey. The uncompressed pubkey prefix is 04 and the compressed pukey is 02 or 03.
This one is 130 characters:
Code:
0405eec604993048314294f7c1f9b45c3ed8424ef940426336153831f8813228a788f845e1df353c2021174573e33f2fab05d94e1dd5e5449832ec83ac3d5db17e
The compressed ones are indeed only half the size. But I meant the ones where the length varies by only 2, which is probably caused by truncating the first 148 characters.
Could it be the 148 characters isn't always the same?
Compare those 2:
Code:
483045022100a83ca95b6b3153c5fce971c1eebbeebc892ba6c297157c326a8359c9b408ce1902201904060ce4e1fbd455403546232779dc9ca7bfe3582d3055270f27f245575d0901410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4
47304402202a6531662b10fdc9d4b4e7ada1a4db2cdba6e0d22155ebbcedb8545a9ddfd1170220760a431d5eac0f32e8f6fc5812cc9ea5c1530eac37f11768288185bb87cba3ea01410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4
newbie
Activity: 24
Merit: 9
Keep the records that are 131 and 66 characters long. 131 chars are for the uncompress pubkey and 66 chars for the compressed pubkey. The uncompressed pubkey prefix is 04 and the compressed pukey is 02 or 03.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Hi, I have no idea why it is doing that.

Is it possible to keep with one of the lines?
I'd need to know which one to keep, and even better: how to decide that from the raw data. The current output is 81 GB (815,497,912 lines), which is too large to sort on this server (only 93 GB disk space remaining). Sorting is needed to remove all duplicates, which is needed to reduce the file size before checking for addresses with a balance.
newbie
Activity: 24
Merit: 9
Hi, I have no idea why it is doing that.

Is it possible to keep with one of the lines? I'd probably guess that the address will be empty anyways or might have to do some manual on the data and the final comparison for a positive balance.

Regards,
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021


Yes, I'm looking for spending_signature_hex. Sorry about the mixup.
I'm currently running this:
Code:
for day in `ls inputs/*gz`; do echo $day; gunzip -c $day | cut -f7,8,18 | grep -v spending_signature_hex | grep pubkeyhash | grep -vP "\t$" >> output2.txt; done
If it doesn't run out of disk space, I'll continue from there tomorrow. It won't fit for sure Sad That's going to be a problem, because I need a few times more disk space to sort the data. 1 TB might not even be enough.

This is a bit smaller:
Code:
for day in `ls /var/www/blockdata.loyce.club/public_html/inputs/*gz`; do echo $day; gunzip -c $day | cut -f7,8,18 | grep -v spending_signature_hex | grep pubkeyhash | grep -vP "\t$" | cut -f1,3 >> output2.txt; done


MrFreeDragon already described what we're looking for here.

I'll try this:
Code:
for day in `ls /var/www/blockdata.loyce.club/public_html/inputs/*gz`; do echo $day; gunzip -c $day | cut -f7,8,18 | grep -v spending_signature_hex | grep pubkeyhash | grep -vP "\t$" | cut -f1,3 > /dev/shm/tmp.file; paste <(cat /dev/shm/tmp.file | cut -f1) <(cat /dev/shm/tmp.file | cut -f2 | cut -c149-) | sort -u -S40% >> output2.txt; rm /dev/shm/tmp.file; done
Explanation: it truncates spending_signature_hex to only the pubkey (I used a detour, but it works without reducing performance much). This reduces the size, and I can remove duplicates for each day already. That should reduce the output size.
I'll let it run overnight.
Update: This produces different results:
Code:
1zxhKVZtMBt8kf7km2shn2mkR4NLHGigT       0405eec604993048314294f7c1f9b45c3ed8424ef940426336153831f8813228a788f845e1df353c2021174573e33f2fab05d94e1dd5e5449832ec83ac3d5db17e
1zxhKVZtMBt8kf7km2shn2mkR4NLHGigT       05eec604993048314294f7c1f9b45c3ed8424ef940426336153831f8813228a788f845e1df353c2021174573e33f2fab05d94e1dd5e5449832ec83ac3d5db17e
Somehow the 148 characters isn't a constant.
Another example with full data:
Code:
block_id transaction_hash index time value value_usd recipient type script_hex is_from_coinbase is_spendable spending_block_id spending_transaction_hash spending_index spending_time spending_value_usd spending_sequence spending_signature_hex spending_witness lifespan cdd
355251 e73b8e08f86b6130ed00531cf3fb9fff4718f87d33bee879ff6119bd6b88f142 0 2015-05-06 18:42:33 300000 0.7111 14XvKaYnFEH2SioW2oRpgCKaMsaFwmFT7C pubkeyhash 76a91426c15da2beb1fb4a6db5ba4c993a806d22cd10d688ac 0 -1 356405 12c440ae9cd07a530d5a942e56cfae0f0b3797bd3c579366ecd0a7fa8a2d24d0 11 2015-05-14 15:51:23 0.7307 4294967295 47304402204a20d3044915706057c2a8a2a2ec7ba7e6ceb5735ab172329565df9bce70ea68022022f81e2d8b041c9eb3526168ab6ae0a6d375b1696abd86a0721cd64c7951be8101410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4 680930 0.023643402777777777
355939 077b2b53f3dc4ab060ad0e5752dd7ab4d036085d58d6e61f1bf335a8e9385c1a 0 2015-05-11 15:34:00 40000 0.097 14XvKaYnFEH2SioW2oRpgCKaMsaFwmFT7C pubkeyhash 76a91426c15da2beb1fb4a6db5ba4c993a806d22cd10d688ac 0 -1 356405 12c440ae9cd07a530d5a942e56cfae0f0b3797bd3c579366ecd0a7fa8a2d24d0 12 2015-05-14 15:51:23 0.0974 4294967295 4730440220066b7baba05a4dca623e7719a7ddca253c0ac9dd9b11c8a4d78a480c2a7b9fbf02203cc0f7f64a919ed44ca5b05cc766f708697e091fee13b9fe3be84b5a12c656c501410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4 260243 0.0012048287037037036
355948 8d43ed844cc99bfdb0faa1fa0f7803af32e21457053aba385380d712393773f5 0 2015-05-11 16:35:42 28300 0.0687 14XvKaYnFEH2SioW2oRpgCKaMsaFwmFT7C pubkeyhash 76a91426c15da2beb1fb4a6db5ba4c993a806d22cd10d688ac 0 -1 356405 12c440ae9cd07a530d5a942e56cfae0f0b3797bd3c579366ecd0a7fa8a2d24d0 13 2015-05-14 15:51:23 0.0689 4294967295 4830450221008bf415b6c4bc7118a1d93ef8f6c63b0801d9abe2e41e390670acf9677ee58e5602200da3df76f11ae04758c947a975f84dd7dba990e00c146b451dc4fa514c6cb52d01410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4 256541 0.0008402905439814814
356093 65c62d27591ae513e9b5ddb1630b083e9f804c0bc89052c5c17e9ee3cdde29a6 0 2015-05-12 14:16:22 10000 0.0242 14XvKaYnFEH2SioW2oRpgCKaMsaFwmFT7C pubkeyhash 76a91426c15da2beb1fb4a6db5ba4c993a806d22cd10d688ac 0 -1 356405 12c440ae9cd07a530d5a942e56cfae0f0b3797bd3c579366ecd0a7fa8a2d24d0 14 2015-05-14 15:51:23 0.0244 4294967295 483045022100a83ca95b6b3153c5fce971c1eebbeebc892ba6c297157c326a8359c9b408ce1902201904060ce4e1fbd455403546232779dc9ca7bfe3582d3055270f27f245575d0901410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4 178501 0.00020659837962962966
356247 9873b5b567b7947fdb271ed37240da6ee45085f9804565c82a02f8a3a376833d 0 2015-05-13 14:46:45 12000 0.0292 14XvKaYnFEH2SioW2oRpgCKaMsaFwmFT7C pubkeyhash 76a91426c15da2beb1fb4a6db5ba4c993a806d22cd10d688ac 0 -1 356405 12c440ae9cd07a530d5a942e56cfae0f0b3797bd3c579366ecd0a7fa8a2d24d0 15 2015-05-14 15:51:23 0.0292 4294967295 47304402202a6531662b10fdc9d4b4e7ada1a4db2cdba6e0d22155ebbcedb8545a9ddfd1170220760a431d5eac0f32e8f6fc5812cc9ea5c1530eac37f11768288185bb87cba3ea01410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4 90278 0.00012538611111111112
356375 a6afeb2893da616d75c39a4eeaed51c91fef715930f356988665a32028cad057 204 2015-05-14 10:33:58 99600 0.2426 14XvKaYnFEH2SioW2oRpgCKaMsaFwmFT7C pubkeyhash 76a91426c15da2beb1fb4a6db5ba4c993a806d22cd10d688ac 0 -1 356405 12c440ae9cd07a530d5a942e56cfae0f0b3797bd3c579366ecd0a7fa8a2d24d0 16 2015-05-14 15:51:23 0.2426 4294967295 47304402202ad9c30164315b7ae2a3d280f629a014d316f4177e4dbde6167bd83c628a4d050220308bba5e628de5a47606d1e0ecd939332f7036a6da7eb6458e2bd6e690bd553d01410421557041f930252b79b0fa28e6587680053b3a3672ff0c1dca6a623c79bdc0b6125a7a2be5450e28e49731ba8f60231dd8eceeff170923717d97a1ca5a67acd4 19045 0.00021954652777777774
Any idea why? The pubkey doesn't seem to have a fixed length either, so counting characters from the end won't work.
newbie
Activity: 24
Merit: 9
Hi LoyceV,

Yes, I'm looking for spending_signature_hex. Sorry about the mixup.

Kindly let it process all the blocks up to the current one.

Thanks again
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Can you process the inputs data dumps by extracting the following columns from them: recipient, type=pubkeyhash only, spending_witness.
I did this:
Code:
for day in `inputs/*gz`; do echo $day; gunzip -c $day | cut -f7,8,19 | grep -v spending_witness | grep pubkeyhash | grep -vP "\t$" >> output.txt; done
It's still running (and will take many hours to complete), but up until 2014, there is no output at all. I'll let it finish to be sure.
Update: as expected: no output.

Quote
Code:
53225 24cf2dedab3c7898ec0f0532e177f15f41984b0c820202bcb143e43eec0c25d2 1 2010-04-27 00:08:11 4106000000 0.4106 15Z5YJaaNSxeynvr6uW6jQZLwq3n1Hu6RX pubkeyhash 76a91431f19a7d0379f56cb3be0761c21f1f0c9553a47f88ac 0 -1 53241 8084468e05c6faa4029fbfd7b9b9d33e2274e9f40634207bcf86197fc6f83af5 1 2010-04-27 02:45:48 0.4106 4294967295 483045022069e95f67cc6fed7db01885d76e3294e443d9833316228bbd057f0f3a3bdd51630221009ef89fa8c34f37a245333c67d56feaef27b26651c4a30b6499e1bc386337ca8f0141047a51392bace353f4c3788c9c090ef4f635ec211159ec3b9f1bb7da7679517e126e98e0012bcb4d2b023c479afaaa1ad703ea1b24e1910e2cdad38744ba7aab8a 9457 4.494264120370371
Are you looking for "483045022069e95f67cc6fed7db01885d76e3294e443d9833316228bbd057f0f3a3bdd516302210 09ef89fa8c34f37a245333c67d56feaef27b26651c4a30b6499e1bc386337ca8f0141047a51392b ace353f4c3788c9c090ef4f635ec211159ec3b9f1bb7da7679517e126e98e0012bcb4d2b023c479 afaaa1ad703ea1b24e1910e2cdad38744ba7aab8a"? That's not spending_witness, it's spending_signature_hex. For pubkeyhash entries, spending_witness is empty (there are double tabs in the .tsv file without any data in between, so without using a spreadsheet it's easy to overlook).
newbie
Activity: 24
Merit: 9
Hi LoyceV,

Can you do something similar to what you did for the outputs? Can you process the inputs data dumps by extracting the following columns from them: recipient, type=pubkeyhash only, spending_witness. Once done, remove duplicate entries and cross-reference with the "List of all funded Bitcoin addresses" keeping only the recipients with a positive balance. Compare the recipients with balance_addy_pubkey.txt and the difference will be the results I'm looking for. The spending_witness will contain the pubkey of those pubkeyhash addresses (truncating the 1st 148 characters will leave only the pukey remaining).

The instructions would be useful too.

Regards,

Code:
53225 24cf2dedab3c7898ec0f0532e177f15f41984b0c820202bcb143e43eec0c25d2 1 2010-04-27 00:08:11 4106000000 0.4106 15Z5YJaaNSxeynvr6uW6jQZLwq3n1Hu6RX pubkeyhash 76a91431f19a7d0379f56cb3be0761c21f1f0c9553a47f88ac 0 -1 53241 8084468e05c6faa4029fbfd7b9b9d33e2274e9f40634207bcf86197fc6f83af5 1 2010-04-27 02:45:48 0.4106 4294967295 483045022069e95f67cc6fed7db01885d76e3294e443d9833316228bbd057f0f3a3bdd51630221009ef89fa8c34f37a245333c67d56feaef27b26651c4a30b6499e1bc386337ca8f0141047a51392bace353f4c3788c9c090ef4f635ec211159ec3b9f1bb7da7679517e126e98e0012bcb4d2b023c479afaaa1ad703ea1b24e1910e2cdad38744ba7aab8a 9457 4.494264120370371
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Get pubkeys out of Bitcoin block data (this was requested here).

Note: this list is not meant for verbatim copy/pasting, it's my own notes of what I did (more or less).

Get outputs data (currently 148 GB)
Code:
wget -r blockdata.loyce.club/outputs/

Get all addresses with pubkey
Code:
for day in `ls outputs/*gz`; do echo $day; gunzip -c $day | grep -v is_from_coinbase | grep -v pubkeyhash | grep pubkey | cut -f 1,2,4-11 >> output.txt; done
(output.txt includes more columns than strictly needed)

Get currently funded Bitcoin addresses and their balance (1 GB)
Code:
wget addresses.loyce.club/blockchair_bitcoin_addresses_and_balance_LATEST.tsv.gz

Get all unique addresses that are in both lists
Code:
comm -12 <(cat output.txt | cut -f6 | sort -u -S40%) <(gunzip -c blockchair_bitcoin_addresses_and_balance_LATEST.tsv.gz | grep -v balance | cut -f1 | grep "^1" | sort -S40% ) > list

Get list of balances, addresses and pubkeys
Code:
gunzip -c blockchair_bitcoin_addresses_and_balance_LATEST.tsv.gz | grep "^1" | sort -rS60% | uniq -w30 -d > address_and_balance
cat <(cat output.txt | cut -f 6,8) list | sort -rS40% | uniq -w30 -d > address_and_pubkey
for addy in `cat list`; do balance="`grep $addy address_and_balance | cut -f2`"; pubkey="`grep $addy address_and_pubkey | cut -f2`"; echo "$balance $addy $pubkey"; done | sort -nr > balance_addy_pubkey.txt
(the last for-loop is quick and dirty, slow and inefficient, but considering there's not that much data, I didn't bother improving it)
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I like simple (Linux) commands that process a lot of data. A console is much more powerful than a GUI.
I've posted them in various topics, but from now on will collect them here. Use them at your own risk.

Warning
Don't just copy paste anything you find online into a console! Try to understand it first.

Self-moderated
No spam please. Questions are of course okay. Adding content is appreciated.

Overview
Pages:
Jump to: