Using compact indexes instead of hashes as identifiers.

jl777

legendary

Activity: 1176

Merit: 1134

Quote from: TierNolan on March 11, 2016, 05:18:12 AM

Quote from: jl777 on March 10, 2016, 08:25:11 PM

but yes, use hashes for non-permanent data, index for permanent

I thought of a possible issue with offline signing. If you want to offline sign a transaction, then you need to include proof that the index refers to a particular transaction output.

Armory already has to include all the inputs into the transaction in order to figure out how much you are spending, so it isn't that big a deal. The problem is that there is no way to prove that a particular index refers to a particular output. You would need to include the header chain and that is only SPV safe.

For offline transactions, using the hash is probably safer, but that doesn't affect internal storage, just if transactions could refer to previous outputs by index.

Maybe I didnt make it clear in this thread. The indexes are used for traversing lists, etc., but all the txids are in the dataset. So the dataset has indexes referring to other indexes (or implicitly by their position) and each index can be converted to its fully expanded form. Tastes great and less calories.

So you can do all the verifications, etc. as it is a local DB/rawfile replacement combined into one. Being read-only for the most part mksquashfs can create a halfsized dataset that also acts to protect it from data corruption. I split out the sigs into separate files, so they can be purged after being validated. I am also thinking of doing the same with pubkeys, so nodes that want to be able to search all pubkeys seen could, but nodes that dont wont need to keep it around.

It works as a lossless codec that stores its data in a way that is ready to do searches without any setup time (for the data it already processed). All addresses are fully indexed, even multisig/p2sh so there is no need for importprivkey. Listunspent serially took ~2 milliseconds on a slow laptop. With the parallel datasets, it is well suited for multi-core searching to allow for linear speedups based on number of cores. even GPU could be used for industrial strength speed as long as there is a pipeline of requests to avoid the latency issues of using GPU per RPC

James

TierNolan

legendary

Activity: 1232

Merit: 1094

Quote from: jl777 on March 10, 2016, 08:25:11 PM

but yes, use hashes for non-permanent data, index for permanent

I thought of a possible issue with offline signing. If you want to offline sign a transaction, then you need to include proof that the index refers to a particular transaction output.

Armory already has to include all the inputs into the transaction in order to figure out how much you are spending, so it isn't that big a deal. The problem is that there is no way to prove that a particular index refers to a particular output. You would need to include the header chain and that is only SPV safe.

For offline transactions, using the hash is probably safer, but that doesn't affect internal storage, just if transactions could refer to previous outputs by index.

jl777

legendary

Activity: 1176

Merit: 1134

Quote from: TierNolan on March 10, 2016, 08:05:46 PM

Quote from: jl777 on March 10, 2016, 04:06:58 PM

canonical encoding means a numbering system for each block, tx, vin, vout so that the same number references the same one. Since the blocks are ordered and the tx are ordered within each block and vins and vouts are ordered within each tx, this is a matter of just iterating through the blockchain in a deterministic way.

Is it just a count of outputs, or ?

Quote

I have no idea how this would cause any privacy loss as it is just using 32 bit integers as pointers to the hashes. The privacy issue was raised as somehow a reason to not use efficient encoding.

Ahh ok, I guess it is confusion due to the thread split. I agree, I see no loss in privacy by referring to transactions using historical positions.

With a hard fork, transactions could have both options. If you want to spend a recent transactions, you could refer to the output by hash. For transactions that are buried deeply, you could use the index. With reorgs, the indexes could be invalidated, but that is very low risk for 100+ confirms.

I used to use a 32bit index for the entire chain, but that doesnt work for parallel sync, plus in a few years it would actually overflow.

now it is a 32 bit index for txidind, unspentind and spendind, for the txids, vouts and vins within each bundle of 2000 blocks

and an index for the bundle, which is less than 16 bits

so (bundle, txidind) and (bundle, unspentind) and (bundle, spendind) would be the corresponding txid, vout and vin within each bundle

but yes, use hashes for non-permanent data, index for permanent

TierNolan

legendary

Activity: 1232

Merit: 1094

Quote from: jl777 on March 10, 2016, 04:06:58 PM

canonical encoding means a numbering system for each block, tx, vin, vout so that the same number references the same one. Since the blocks are ordered and the tx are ordered within each block and vins and vouts are ordered within each tx, this is a matter of just iterating through the blockchain in a deterministic way.

Is it just a count of outputs, or ?

Quote

I have no idea how this would cause any privacy loss as it is just using 32 bit integers as pointers to the hashes. The privacy issue was raised as somehow a reason to not use efficient encoding.

Ahh ok, I guess it is confusion due to the thread split. I agree, I see no loss in privacy by referring to transactions using historical positions.

With a hard fork, transactions could have both options. If you want to spend a recent transactions, you could refer to the output by hash. For transactions that are buried deeply, you could use the index. With reorgs, the indexes could be invalidated, but that is very low risk for 100+ confirms.

jl777

legendary

Activity: 1176

Merit: 1134

Quote from: TierNolan on March 10, 2016, 02:58:55 PM

Quote from: jl777 on February 25, 2016, 10:50:34 PM

I think we have to admit that a large part of the BTC blockchain has been deanonymized. And until new methods like CT are adopted, this will only get worse.

So rather than fight a losing battle, why not accept that there are people that dont care much about privacy and convenience is more important. In fact they might be required to be public. The canonical encoding allows to encode the existing blockchain and future public blockchain at much better than any other method as it ends up as high entropy compressed 32bit numbers, vs 32byte txid + vout. The savings is much more than 12 bytes, it takes only 6 bytes to encode an unspent, so closer to 30 bytes.

What exactly do you mean by "canonical encoding"? What is the privacy loss here?

canonical encoding means a numbering system for each block, tx, vin, vout so that the same number references the same one. Since the blocks are ordered and the tx are ordered within each block and vins and vouts are ordered within each tx, this is a matter of just iterating through the blockchain in a deterministic way.

I have no idea how this would cause any privacy loss as it is just using 32 bit integers as pointers to the hashes. The privacy issue was raised as somehow a reason to not use efficient encoding.

James

yes, I understand that if the blockchain reorgs, the 32bit indexes will need to be recalculated for the ones affected and orphans have no index

TierNolan

legendary

Activity: 1232

Merit: 1094

Quote from: jl777 on February 25, 2016, 10:50:34 PM

I think we have to admit that a large part of the BTC blockchain has been deanonymized. And until new methods like CT are adopted, this will only get worse.

So rather than fight a losing battle, why not accept that there are people that dont care much about privacy and convenience is more important. In fact they might be required to be public. The canonical encoding allows to encode the existing blockchain and future public blockchain at much better than any other method as it ends up as high entropy compressed 32bit numbers, vs 32byte txid + vout. The savings is much more than 12 bytes, it takes only 6 bytes to encode an unspent, so closer to 30 bytes.

What exactly do you mean by "canonical encoding"? What is the privacy loss here?

watashi-kokoto

sr. member

Activity: 690

Merit: 269

I'm looking forward segwit so much. Tried segnet from github but didn't observe anything

it would be so amazing to try some altcoin with segwit, just to test it.

jl777

legendary

Activity: 1176

Merit: 1134

Quote from: amaclin on March 09, 2016, 04:10:54 PM

Quote from: jl777 on March 09, 2016, 03:28:24 PM

I create separate files that just contain the signatures, so pruning them is a matter of just deleting the files
and yes, the index data is compressible, about 2x

Why not to delete all block files? Grin

The algorithm would be:
1) check incoming transactions (wild and confirmed) for ther validity except checking that the inputs are unspent or even exist
2) relay valid transactions to your peers
3) keep only last 100 (200-500-2016) blocks in the blockchain on hard drive

This would save 95% of hard disk space

sure for a pruned node that is fine, but I want a full node with smallest footprint
but given time I plan to support all the various different modes

jl777

legendary

Activity: 1176

Merit: 1134

Quote from: AlexGR on March 09, 2016, 04:24:53 PM

Quote from: jl777 on March 09, 2016, 03:28:24 PM

Quote from: amaclin on March 07, 2016, 01:38:40 PM

Quote from: watashi-kokoto on March 07, 2016, 12:43:38 PM

Segwit prunes old signatures, and since the signatures are major source of entropy it may make the leftover better compressible

What?

I create separate files that just contain the signatures, so pruning them is a matter of just deleting the files

and yes, the index data is compressible, about 2x

What compression are you using?

LRZIP is *very* good for large files (the larger the file, the more redundancies it can find).

(A typical 130mb block.dat will be down to 97-98mb with gzip/bzip2, but can go under 90 with lrzip).

I am just using mksquashfs so I get a compressed readonly filesytem
this protects all the data from tampering so there is little need to reverify anything.
the default is compressing the index data in about half, and the sigs data is in separate directory now, so after initial validation it can just be deleted, unless you want to run a full relaying node.

I havent gotten a complete run yet with the latest iteration of data, so dont have exact sizes. I expect that the non-signature dataset will come in at less than 20GB for the full blockchain. The reason it gets such good compression is that most of the indices are small numbers that happen a lot, over and over. By mapping the high entropy pubkeys, txids, etc. to a 32 bit index, not only is there a 8x compression, the resulting index is compressible, so probably about a 12x compression.

the vin's are the bulkiest, but I encode that into a metascript as described in https://bitco.in/forum/threads/vinmetascript-encoding-03494921.942/

James

AlexGR

legendary

Activity: 1708

Merit: 1049

Quote from: jl777 on March 09, 2016, 03:28:24 PM

Quote from: amaclin on March 07, 2016, 01:38:40 PM

Quote from: watashi-kokoto on March 07, 2016, 12:43:38 PM

Segwit prunes old signatures, and since the signatures are major source of entropy it may make the leftover better compressible

What?

I create separate files that just contain the signatures, so pruning them is a matter of just deleting the files

and yes, the index data is compressible, about 2x

What compression are you using?

LRZIP is *very* good for large files (the larger the file, the more redundancies it can find).

(A typical 130mb block.dat will be down to 97-98mb with gzip/bzip2, but can go under 90 with lrzip).

amaclin

legendary

Activity: 1260

Merit: 1019

Quote from: jl777 on March 09, 2016, 03:28:24 PM

I create separate files that just contain the signatures, so pruning them is a matter of just deleting the files
and yes, the index data is compressible, about 2x

Why not to delete all block files? Grin

The algorithm would be:
1) check incoming transactions (wild and confirmed) for ther validity except checking that the inputs are unspent or even exist
2) relay valid transactions to your peers
3) keep only last 100 (200-500-2016) blocks in the blockchain on hard drive

This would save 95% of hard disk space

jl777

legendary

Activity: 1176

Merit: 1134

Quote from: amaclin on March 07, 2016, 01:38:40 PM

Quote from: watashi-kokoto on March 07, 2016, 12:43:38 PM

Segwit prunes old signatures, and since the signatures are major source of entropy it may make the leftover better compressible

What?

I create separate files that just contain the signatures, so pruning them is a matter of just deleting the files

and yes, the index data is compressible, about 2x

amaclin

legendary

Activity: 1260

Merit: 1019

Quote from: watashi-kokoto on March 07, 2016, 12:43:38 PM

Segwit prunes old signatures, and since the signatures are major source of entropy it may make the leftover better compressible

What?

watashi-kokoto

sr. member

Activity: 690

Merit: 269

Segwit prunes old signatures, and since the signatures are major source of entropy it may make the leftover better compressible

jl777

legendary

Activity: 1176

Merit: 1134

Quote from: AlexGR on March 03, 2016, 11:59:07 AM

Quote from: jl777 on March 03, 2016, 11:07:13 AM

The problem is that a lot of the data is high entropy and not compressible. I believe I have identified and extracted most of the redundancy. compressing the dataset is a onetime operation and doesnt really take much time to decompress.

What seems incompressible can generally be compressed more if one shifts the order of characters around a file. For example: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform (that's what bzip does btw). In theory a compression system could use multiple such "tables" and try various configurations to find new redundancies, folding and re-folding the data again and again, but obviously I'm not going to ask you to invent such a system Cool

In any case, whatever degree of compression is achieved => is good. If a compressed block, for example, can be reduced from 1mb to 600kb, it's a huge improvement for someone with a slow connection.

if you can compress rmd160(sha256)) and sha256(sha256()) by any more than a few percent, it probably proves they are not of cryptographic strength and open to plain text attacks among others. SHA256 is supposed to be high entropy, right? If so, it isnt compressible.

Given this, removing the duplication, ie numvouts references to identical txid as they all get spent, identical rmd160[20] being referenced in the reused addresses along with the pubkeys when signed.

So what I do is to create a structured data set without the high entropy and then compress just that part to achieve a 75% compression. Using parallel sync the time to sync that on a 20 mbps home connection would be less than 2 hours, but that leaves out the sigs, which double the size to about 30GB total. However, if we can skip all the sigs before block 290000, then probably 10GB+ of that is not needed.

on /r/Bitcoin front page there is a mention of this with a lot of details on the internals

James

AlexGR

legendary

Activity: 1708

Merit: 1049

Quote from: jl777 on March 03, 2016, 11:07:13 AM

The problem is that a lot of the data is high entropy and not compressible. I believe I have identified and extracted most of the redundancy. compressing the dataset is a onetime operation and doesnt really take much time to decompress.

What seems incompressible can generally be compressed more if one shifts the order of characters around a file. For example: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform (that's what bzip does btw). In theory a compression system could use multiple such "tables" and try various configurations to find new redundancies, folding and re-folding the data again and again, but obviously I'm not going to ask you to invent such a system Cool

In any case, whatever degree of compression is achieved => is good. If a compressed block, for example, can be reduced from 1mb to 600kb, it's a huge improvement for someone with a slow connection.

jl777

legendary

Activity: 1176

Merit: 1134

Quote from: AlexGR on March 03, 2016, 10:22:38 AM

Quote from: jl777 on February 25, 2016, 10:50:34 PM

These are not just thoughts, but description of a working system that syncs entire blockchain and creates the above data structures in about half hour if you have enough bandwidth and 8 cores. It is bandwidth limited, so for a slow home connection of 20mbps, it takes 6hrs for full sync.

Bandwidth limited = I/O and cpu limited then?

We had a similar discussion in another thread, but anyway I want to write it more concisely here: I really believe scaling can be achieved through tradeoffs of what a user has and what he hasn't. Every machine will hit a bottleneck somewhere, the issue from that point onward is how to juggle the tradeoffs to bypass your bottleneck.

Examples:

-A user has a slow internet connection but plenty of CPU => He should have the option of downloading (and uploading - if he is running a node) very compressed blocks with other cooperating nodes that are willing to use compressed blocks.
-If the same user has slow CPU as well => He should be able to tap into GPU resources for parallel compression/decompression (it's feasible - I've seen pdf's with research on the subject of parallel compression on GPU).
-If a user has slow I/O but fast CPU => Compress the blockchain for more local throughput
-If a user has low RAM but fast CPU => Compress the RAM (for things like compressed cache)
-If a user has slow I/O or low RAM but will bottleneck on CPU if he goes to a compressed scheme => tap into GPU
-If a user wants to increase cache size to increase performance, but lacks ram => Use compressed RAM, tap into more CPU, to reduce I/O.

Essentially you try to trade what you have with what you want to achieve. One size fits all solutions won't work particularly well because one user has 1gbps connection, another has 4mbps. One is running on a xeon, the other is running on a laptop. One has 16gb memory, another has 2. One can tap into a GPU for parallel processes, another can't because he doesn't have one.

IMO the best scaling solutions will be adjustable so that bottlenecks can be identified in a particular system and make the necessary tradeoffs to compensate for deficiencies. Compression is definitely one of the tools that can balance these needs, whether one lacks RAM, disk space on their SSD, bandwidth on their network etc. And GPUs can definitely help in the processing task - whether the load is directly related to btc operations or compression/decompression.

The problem is that a lot of the data is high entropy and not compressible. I believe I have identified and extracted most of the redundancy. compressing the dataset is a onetime operation and doesnt really take much time to decompress.

The CPU is a bottleneck only if your connections is faster than 500megabits/sec, probably half that as that is without verifying the sigs yet, but there is plenty of idle time during the sync.

With the bandwidth varying, we have 30 minutes to 6 hours time to sync. The slower connection provides plenty of time for the CPU

I eliminated HDD as a bottleneck by eliminating most of the seeks and using an append only set of files direct into memory mappable data structures.

Each bundle can then be verified independently and a chain of the bundle summaries can then be used for a quick start, with the validity checked over the next 6hrs. Just need to not allow tx until it is fully verified, or limit the tx to ones that are known to be valid, ie fresh addresses getting new funds sent to it.

I use adaptive methods so it goes as fast as the entire system is able to. After all if you only have one core, then it will be the bottleneck to valiate 100,000 blocks of sigs. The way to have different "sizes" is to have a full node, validating node, truncated node (dump sigs after they are verified), lite nodes.

James

AlexGR

legendary

Activity: 1708

Merit: 1049

Quote from: jl777 on February 25, 2016, 10:50:34 PM

These are not just thoughts, but description of a working system that syncs entire blockchain and creates the above data structures in about half hour if you have enough bandwidth and 8 cores. It is bandwidth limited, so for a slow home connection of 20mbps, it takes 6hrs for full sync.

Bandwidth limited = I/O and cpu limited then?

We had a similar discussion in another thread, but anyway I want to write it more concisely here: I really believe scaling can be achieved through tradeoffs of what a user has and what he hasn't. Every machine will hit a bottleneck somewhere, the issue from that point onward is how to juggle the tradeoffs to bypass your bottleneck.

Examples:

-A user has a slow internet connection but plenty of CPU => He should have the option of downloading (and uploading - if he is running a node) very compressed blocks with other cooperating nodes that are willing to use compressed blocks.
-If the same user has slow CPU as well => He should be able to tap into GPU resources for parallel compression/decompression (it's feasible - I've seen pdf's with research on the subject of parallel compression on GPU).
-If a user has slow I/O but fast CPU => Compress the blockchain for more local throughput
-If a user has low RAM but fast CPU => Compress the RAM (for things like compressed cache)
-If a user has slow I/O or low RAM but will bottleneck on CPU if he goes to a compressed scheme => tap into GPU
-If a user wants to increase cache size to increase performance, but lacks ram => Use compressed RAM, tap into more CPU, to reduce I/O.

Essentially you try to trade what you have with what you want to achieve. One size fits all solutions won't work particularly well because one user has 1gbps connection, another has 4mbps. One is running on a xeon, the other is running on a laptop. One has 16gb memory, another has 2. One can tap into a GPU for parallel processes, another can't because he doesn't have one.

IMO the best scaling solutions will be adjustable so that bottlenecks can be identified in a particular system and make the necessary tradeoffs to compensate for deficiencies. Compression is definitely one of the tools that can balance these needs, whether one lacks RAM, disk space on their SSD, bandwidth on their network etc. And GPUs can definitely help in the processing task - whether the load is directly related to btc operations or compression/decompression.

jl777

legendary

Activity: 1176

Merit: 1134

Quote from: marcus_of_augustus on March 03, 2016, 06:05:04 AM

Quote

What number of blocks is beyond the common sense reorg danger? I know theoretically very long reorgs are possible, but from what I have seen it is not so common for any big reorg. I think by setting a delay of N blocks before creating the most recent bundle, then the odds of having to regenerate that bundle would be very low. Since it takes about a minute to regenerate it, it isnt a giant disaster if it has to, but best to default things to almost never has to do it.

For reference, there is a protocol rule which prohibits generated BTC (coinbase tx) from being spent until it has 100 confirmations. Older Satoshi client had an additional safety margin: it won't let you spend generated BTC until it has 120 confirmations. Discussion surrounding locking coins for sidechains was in the 144-288 blocks depth range.

thanks. maybe 128? That is close to a full day and power of 2

I dont know the formal to calculate the odds of a reorg that big, but short of some sort of attack or buggy mainchain release, it would seem to be quite small chance if 10 blocks gets to six sigma

marcus_of_augustus

legendary

Activity: 3920

Merit: 2349

Eadem mutata resurgo

Quote

What number of blocks is beyond the common sense reorg danger? I know theoretically very long reorgs are possible, but from what I have seen it is not so common for any big reorg. I think by setting a delay of N blocks before creating the most recent bundle, then the odds of having to regenerate that bundle would be very low. Since it takes about a minute to regenerate it, it isnt a giant disaster if it has to, but best to default things to almost never has to do it.

For reference, there is a protocol rule which prohibits generated BTC (coinbase tx) from being spent until it has 100 confirmations. Older Satoshi client had an additional safety margin: it won't let you spend generated BTC until it has 120 confirmations. Discussion surrounding locking coins for sidechains was in the 144-288 blocks depth range.

Topic: Using compact indexes instead of hashes as identifiers. (Read 2877 times)