Pages:
Author

Topic: Ultimate blockchain compression w/ trust-free lite nodes - page 12. (Read 87954 times)

legendary
Activity: 1428
Merit: 1093
Core Armory Developer
By recommending a binary tree, or rather, a Merkle tree in the form most resembling a binary tree, I am suggesting that from block to block, the miner has the option of either just updating the tree (in a well-defined deterministic manner but leaving it unbalanced) or updating and rebalancing the tree such that all of the leaf nodes are the same distance (or distance-1) from the root, again in a well-defined deterministic manner.

I am not suggesting leaving the implementation up to the STL or any other form of library.

I don't believe Patricia trees are "simpler" when measured in the amount of human learning and neurons one must dedicate to understanding the concept.  That doesn't mean I think it's too hard to learn, but rather, I doubt the cost (measured in terms of complexity of the specification) is worth the benefit.

If you tell a developer, "Now you've got to learn what a Patricia tree is", and then "Now that you've implemented it, you've got to simulate numerous cases of rollbacks to test and feel good that your implementation works backwards as well as forward" you have just made many more developers say "to hell with it, I'll develop something else".

... not to mention implementing a chain reorg strategy consisting of "now talk to your peers and start asking them for now-orphaned blocks (hopefully they have them still) so you can roll your tree back intact" rather than starting with a copy of the tree at or before the point in time of the split and rolling it forward.

Either way, the developer has to get into the implementation details of the data structure.  They have to understand it.  And really, neither structure is particularly complicated.  Perhaps some devs are more familiar with BSTs.  But to say that a miner "has the option" to rebalance -- that doesn't make sense.  Any rebalancing operation on a BST will change the root hash.  It must be determined from the start exactly when and how rebalance ops will happen.  Or else everyone gets different answers.

And as for the comment about "simulating numerous cases of rollbacks" -- This case is dramatically simpler with a patricia tree structure -- you just add and remove elements from the tree using standard insert & delete operations.  It doesn't get much simpler than that (besides maybe keeping around the last few blocks worth of deleted TxOuts, which is probably a few kB).  On the other hand, you may be talking about gigabytes of data to store "backups" or "snapshots" of the BST, just in order to accommodate the possibility of a rollback.  And how many copies do you need to store?  You can keep the last state of the tree, but what if there's a 2-block reorg?  Well, now you need two copies.  To handle arbitrary-sized rollbacks, you could really be thrashing your hard-drive, and in such a way that everything has changed while you were attempting to swap gigabytes of data around.
vip
Activity: 1386
Merit: 1140
The Casascius 1oz 10BTC Silver Round (w/ Gold B)
If you tell a developer, "Now you've got to learn what a Patricia tree is", and then "Now that you've implemented it, you've got to simulate numerous cases of rollbacks to test and feel good that your implementation works backwards as well as forward" you have just made many more developers say "to hell with it, I'll develop something else".

This sort of a developer is dangerous to have developing Bitcoin's internal structures. I think it's better they stay away.


Which sort of developer?  The one who revels in complexity, as though complexity breeds integrity?  This guy is surely already busy on his first implementation of Bitcoin, in assembly language.  He'll be done by 2017, assuming the architecture he's developing for is still popular enough that people will be able to run it.

Or do you mean the one who walks away?  And this benefits bitcoin because the fewer clients, the better?
legendary
Activity: 2128
Merit: 1073
It will almost certainly require a specific implementation in every language clients need. This has dangers, but it's not impossible. Extensive test sets will be needed that have 100% coverage in a reference implementation, to be sure every edge case or weird rebalancing rule is tested.
I think the requirement for "every language" is a vast overkill. I would say from my past experience is sufficient to have a clean portable C implemenation (or C++ implementation in a C-style, without the reliance on the unstable C++ language features like std::* or boost::*). Once that's done in my experience the code can be transcribed to just about anything that is Turing-equivalent.

But such C (or subset-C++) implementation will have to be correctly deal with endianness and alignment issues.

I'm not sure if the core development team is willing to commit to an endian-correct (or endian-neutral) implementation of Bitcoin.
member
Activity: 97
Merit: 10
If you tell a developer, "Now you've got to learn what a Patricia tree is", and then "Now that you've implemented it, you've got to simulate numerous cases of rollbacks to test and feel good that your implementation works backwards as well as forward" you have just made many more developers say "to hell with it, I'll develop something else".

This sort of a developer is dangerous to have developing Bitcoin's internal structures. I think it's better they stay away.
vip
Activity: 1386
Merit: 1140
The Casascius 1oz 10BTC Silver Round (w/ Gold B)
By recommending a binary tree, or rather, a Merkle tree in the form most resembling a binary tree, I am suggesting that from block to block, the miner has the option of either just updating the tree (in a well-defined deterministic manner but leaving it unbalanced) or updating and rebalancing the tree such that all of the leaf nodes are the same distance (or distance-1) from the root, again in a well-defined deterministic manner.

I am not suggesting leaving the implementation up to the STL or any other form of library.

I don't believe Patricia trees are "simpler" when measured in the amount of human learning and neurons one must dedicate to understanding the concept.  That doesn't mean I think it's too hard to learn, but rather, I doubt the cost (measured in terms of complexity of the specification) is worth the benefit.

If you tell a developer, "Now you've got to learn what a Patricia tree is to write a Bitcoin client", and then "Now that you've implemented it, you've got to simulate numerous cases of rollbacks to test and feel good that your implementation works backwards as well as forward" you have just made many more developers say "to hell with it, I'll develop something else".

... not to mention implementing a chain reorg strategy consisting of "now talk to your peers and start asking them for now-orphaned blocks (hopefully they have them still), preferably delivered in reverse order, so you can roll your tree back intact" rather than starting with a copy of the tree at or before the point in time of the split and rolling it forward.
legendary
Activity: 1072
Merit: 1189
Obviously a fully-specified algorithm will need to be decided upon. That is no different than the definition of the current Merkle tree implementation (together with its weird odd-number-of-hashes-on-level behaviour) currently in use by Bitcoin.

Obviously implementations will be necessary for this, and one will not be able to reuse standard libraries that have not fully specified behaviour (or don't have authentication built in). It will almost certainly require a specific implementation in every language clients need. This has dangers, but it's not impossible. Extensive test sets will be needed that have 100% coverage in a reference implementation, to be sure every edge case or weird rebalancing rule is tested.

My gut feeling makes me prefer trie/patricia based solutions, because their structure is independent from the history that lead to their contents. This means a simply diff of the set represented by two tries is enough to specify the modification from one to the other. Authenticated balanced trees on the other hand do expose their history through the merkle root, and differences need to contain structural information. This is not a good reason, just a preference, and the one who implements the first usable and acceptable solutions will probably get to specify what data structure is chosen. Fully deterministic set representation may make an implementation easier to test, though.
legendary
Activity: 2128
Merit: 1073
Because the structure is guaranteed.
I am not buying this guarantee. In this Bitcoin milieu I've soo much confused-endian (or accidental-endian) code that I will not take anything for granted, least of which that there will be a sensible agreement on how to represent the bignums.

Edit: Or maybe I will state it like this: The structure will be guaranteed so long as any implementor is confused the same way as the original authors of the Satoshi client and finds the implementation errors made in the original client obvious and natural.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
Yes, it's possible to BSTs, but it's a lot of work.  And it's not simple.  To design this data structure around BSTs requires every implementer of the meta-chain to use this specific BST algorithm.  There is no such ambiguity with Trie structures -- you could look them up in any textbook.

It's true that every implementer will need to use the same algorithm, otherwise the root hashes will be incompatible. And of course you can't just use a standard library map because those do not support computing hashes!

But there are plenty of ambiguities involved in using tries. Here's an earlier quote from you where you mentioned an optimization using skip strings:
Quote
In the Patricia/Hybrid Tree case, there are purely branch nodes and leaf nodes, though the branch nodes may have "skip strings".  So a leaf node's hash value is just the root hash of the subtree.  And a branch node's value is the hash of the concatenated skip string and its non-null child node values.

Surely if you use skip strings, that will change the root hash, so everyone would have to agree on the particular algorithm to use? Let's make several implementations, and evaluate which one is the best.

"Skip strings" are part of the standard Patricia tree implementation.  It just may be called something different in different texts.  Either you use a Trie, Patricia tree, a De la Brandais tree, or a Hybrid tree.  Once the correct one is agreed upon, the ambiguities in implementation shrink to basically nothing.  It becomes a question of how to traverse and aggregate tree data for Bitcoin purposes, not how to implement the data structure.  That's something that will have to be done for any data structure that is used.

On the other hand, a red-black tree that is optimized differently, and thus produce different root hash, will still be called a red-black tree.  To describe to someone what that optimization is, well, requires a description of the algorithm (and probably code samples).  

I know it's possible, I'm just pointing out the difficulties that could arise out of different people unknowingly producing different tree structures.  Most likely, under bizarre conditions with complicated rebalance operations, and it would be remarkably frustrating to debug.


I'm pointing all this out so that you can't say no progress is being made! Until someone from the 'trie' camp catches up, the simplest solution is a BST since some code for this already exists.

I do appreciate that you are doing this.  I wish I had time for it.  Perhaps your C++ implementation is sufficient for porting to other languages, so that such a uniform implementation can be achieved.   Clearly, I'm still opposed to it for other reasons (like necessitating backups/snapshots for re-orgs), but you can still accomplish what is needed.   And I hope that we can hit all the snags and start figuring them out sooner than later.


The claim "This just won't be possible using BST's, though." is plain false. It confuses the data structure and algorithm with their implementation. This gotta be some sort of miscommunication, or maybe the author had too much fun at a party yesterday.

Perhaps you misunderstood me.  I was saying it won't be possible for everyone to agree on the root hash unless they use the exact same implementation.  Standard implementations (such as STL map<>) are standardized only in run-time, not underlying structure.  Thus, a widespread "experiment" using BSTs won't be simple without a uniform implementation across all languages, etc.  This may be a lot of work.  However, if it was tries, I consider it quite simple that anyone can download any [correct] trie implementation from anywhere, and know that they can get the same answer.  Because the structure is guaranteed.

legendary
Activity: 2128
Merit: 1073
full member
Activity: 126
Merit: 110
Andrew Miller
Yes, it's possible to BSTs, but it's a lot of work.  And it's not simple.  To design this data structure around BSTs requires every implementer of the meta-chain to use this specific BST algorithm.  There is no such ambiguity with Trie structures -- you could look them up in any textbook.

It's true that every implementer will need to use the same algorithm, otherwise the root hashes will be incompatible. And of course you can't just use a standard library map because those do not support computing hashes!

But there are plenty of ambiguities involved in using tries. Here's an earlier quote from you where you mentioned an optimization using skip strings:
Quote
In the Patricia/Hybrid Tree case, there are purely branch nodes and leaf nodes, though the branch nodes may have "skip strings".  So a leaf node's hash value is just the root hash of the subtree.  And a branch node's value is the hash of the concatenated skip string and its non-null child node values.

Surely if you use skip strings, that will change the root hash, so everyone would have to agree on the particular algorithm to use? Let's make several implementations, and evaluate which one is the best.

By way of an update, I am still making gradual progress on a BST implementation. Previously I had implemented a red-black balanced merkle tree in python and described the complete protocol so that any independent implementation should arrive at the same hash. Unfortunately the implementation was too slow/memory-intensive to get past 130k or so of the blockchain. Since then, I have made a C++ implementation that's thousands of times faster (it runs within a factor of 3 as fast as the linux kernel LLRB tree, and although it doesn't compute hashes, the structure of the tree will be the same).
https://github.com/amiller/redblackmerkle/blob/llrb/c_redblack.hpp I have also sketched out a solution for I/O efficient validation of a batch of updates involving a priority queue. But I'm not prepared to make a full post on this yet.

I'm pointing all this out so that you can't say no progress is being made! Until someone from the 'trie' camp catches up, the simplest solution is a BST since some code for this already exists.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
vip
Activity: 1386
Merit: 1140
The Casascius 1oz 10BTC Silver Round (w/ Gold B)
I know it's been a while since this topic has been visited, but I'd like to make a proposal here:

Rather than settle on the "best" way to implement this tree, how about settle on the "simplest", so that way the community can catch the bug and start cranking their brains on the best way to implement this tree with the right balance of features and complexity when the time comes to consider making it part of the protocol standard.

By "simplest", I mean implemented as follows:

1. A simple binary tree that starts out balanced, but maintaining the balancing from block to block is optional.  A miner has the choice to rebalance the tree for the next block, but doesn't have to.  The lack of a requirement to keep the tree balanced is meant as an effort to discourage mining empty blocks because a miner doesn't want the CPU burden or any delay associated with rebuilding the whole tree with each incoming transaction.

2. No ability to roll back.  Rolling back must be accomplished either by rebuilding the tree from scratch, or by starting with a known good backup and rolling it forward.  Backups are systematically deleted such that the backup burden grows O(log n) relative to the total block height.  More specifically, the backup of any given block's tree should have a lifetime of 2^n blocks where n is the number of contiguous zero bits at the end of the block height.  Block 0x7890's tree backup should last sixteen blocks because 0x7890 ends in four zero bits.  The backup for the tree of block 0x10000 should last 0x10000 blocks.

Now, why would I suggest a methodology that clearly avoids taking advantage of features that would make a "better mousetrap" so to speak?  Here are the most prominent reasons:

1. At some point, the Bitcoin community may come to a consensus that we should redefine a valid Bitcoin block to incorporate the root hash of a valid meta-tree rather than having it be optional coinbase clutter.  Until then, this is merely an optional performance-enhancing and usability-enhancing feature without any hard commitment to a standard.  We should help people understand the base case for what it is, and then move on to derivative cases that do the job better.

2. There is serious value in simplicity.  The more things are needlessly complex, the higher the barrier to entry for new developers of bitcoin software.  We are at a point where we need more developers on board than we need the disk space saved by what would be (for the current block height and all block heights for the foreseeable future) about 20 backups of the meta tree on each user's disk.  Besides being much more difficult for the average developer to understand, requiring a tree that must do a moonwalk during a rare edge case which is very difficult for a developer to reproduce and test makes for an exploitable danger that implementations may fail to do the right thing when the right thing is needed the most.

3. The Bitcoin community could use the lessons learned in a basic "proof of concept" implementation of this without being committed to any specific methodology for optimizing it.  This will help the community at large understand which use cases evolve from the availability of the tree, and then come to an intelligent consensus as to what features and attributes of a meta tree are the most valuable and which bargains of complexity versus power are worth making.
vip
Activity: 1386
Merit: 1140
The Casascius 1oz 10BTC Silver Round (w/ Gold B)
You are surely aware that with growth incoming transactions can start to number into the hundreds and the thousands per second.
Can it really? Current limitations on block sizes limit the number of transactions to no more than a thousand per block, or a few per second. Changing those limitations would result in a hard-fork.

Or am I missing something?

That's true in regards to current limitations, but one day we'll be thinking about how to scale past them.  It would be preferable for that scaling process to be able to happen without the requirement that the tree algorithm be rethought with it because the chosen candidate ended up becoming the slowest link in the chain.
full member
Activity: 126
Merit: 110
Andrew Miller
I don't follow your logic.  Did I miss something?  Can you elaborate?

Yes! You're missing the step where you take a hash of the children's hashes, not just their values. For example, suppose your root node is fully stocked and contains "0123456789". You seem to be saying its hash would be H("0123456789"). That tells you the values of the children, but it does not let you look up (and validate) the rest of the data in the next node. For a hash tree to work, you need to take the hash-of-a-hash at each step.
legendary
Activity: 905
Merit: 1012
You are surely aware that with growth incoming transactions can start to number into the hundreds and the thousands per second.
Can it really? Current limitations on block sizes limit the number of transactions to no more than a thousand per block, or a few per second. Changing those limitations would result in a hard-fork.

Or am I missing something?
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
I haven't been following the tree logic as I only understand them to a certain extent, but I just wanted to make sure that whatever tree structure is chosen, that updating the tree to accommodate new incoming transactions is nearly always instantaneous no matter what the size.  You are surely aware that with growth incoming transactions can start to number into the hundreds and the thousands per second.  If updating the tree for each incoming transaction is burdensome in resources, it will create a perverse incentive for miners toward taking shortcuts in transaction processing or omitting transactions from blocks.  I'm sure you already know this, just wanted to make sure this weighed somewhere decent on the requirements list for what structure is eventually chosen.

Inserting, deleting or changing nodes only involves modifying the branch on which that node operates.  So you find the node to be modified, then you recalculate its parent, then recalculate its parent, and so on up to the root node.  Given the structure of the tree, that's likely to only be 4-6 node recalculations per update.  However, each of those nodes might be dense, meaning you might be concatenating a couple hundred values.  However, the total computation is completely independent of tree size (well, it will always be less than 20 nodes to update, though the actually number will asymptotically increase towards 20 as the tree gets bigger [though in reality it will probably never exceed 8 even if 100% of the world switched to BTC]).
vip
Activity: 1386
Merit: 1140
The Casascius 1oz 10BTC Silver Round (w/ Gold B)
I haven't been following the tree logic as I only understand them to a certain extent, but I just wanted to make sure that whatever tree structure is chosen, that updating the tree to accommodate new incoming transactions is nearly always instantaneous no matter what the size.  You are surely aware that with growth incoming transactions can start to number into the hundreds and the thousands per second.  If updating the tree for each incoming transaction is burdensome in resources, it will create a perverse incentive for miners toward taking shortcuts in transaction processing or omitting transactions from blocks.  I'm sure you already know this, just wanted to make sure this weighed somewhere decent on the requirements list for what structure is eventually chosen.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
Ah but etotheipi, let's focus on the last image, since that is the first one to mention hashes.

You have taken the hash of the values of the child nodes, but not the hash of the children's hashes. You cannot securely traverse this tree from the root hash.

I don't follow your logic.  Did I miss something?  Can you elaborate?
full member
Activity: 126
Merit: 110
Andrew Miller
Ah but etotheipi, let's focus on the last image, since that is the first one to mention hashes.

You have taken the hash of the values of the child nodes, but not the hash of the children's hashes. You cannot securely traverse this tree from the root hash.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
Pictures!

I love inkscape.  Here is a visual explanation of tries, Patricia trees, Patricia-Brandais Hybrid trees, and the technique for getting "hash values" out of each node.  I need a better name than "hash values", which is kind of vague... "signature" of the node?  "shape" of the node?  Ehh, whatever, enjoy the pictures!


A basic trie on base-10 keys (which in this proposal would be base-256, and either address strings or TxOut values)



Patricia trees are the same, but with level compression.  This is critical since trees/tries will be very sparse relative to the fact they can theoretically hold 2^160 or 2^256 nodes.  One remaining problem is that all branch nodes still store 256 pointers, even when 255 of them are NULL.



The hybrid tree is what I would propose.  The pointer-arrays are converted to linked lists and only non-null children are stored



Since this data structure would also be used for the TxOut subtrees, it should have a compact representation for one-node trees.  Indeed it does:



The "values" of each leaf is just the root of the sub tree, and the value of each branch is the skip-string concatenated with all its children's values.



Pages:
Jump to: