This is a concrete proposal (and reference implementation) for a Merkle-tree of unspent-outputs (UTXOs). Eventually, the idea is to include the root hash of the current UTXO-tree in each block. This will enable full-validation (zero-trust) lite clients that use only a constant O(1) amount of storage - for example, a client that fits on a smart-card - and O(log N) amount of effort per txOut (where N is the total number of UTXOs at any given time).
**EDIT**: I included several changes from
a follow-up post that improve transaction-validation complexity.
OverviewValidating a transaction requires checking that every tx-input hasn't already been spent. One way of doing this is to iterate, for each tx-input, from the head of the chain backwards, looking for a transaction that spends it. This would be inefficient - instead, the standard client (including Peter Wuille's
UltraPrune variation) builds and maintains a database containing (at least) all the currently available coins. I propose to store the unspent outputs, as well as an additional "query-by-address" index, in a balanced Merkle search tree, (i.e., a RedBlack tree augmented with hashes). This will lead to several benefits:
- Full-validation nodes (especially lite clients) will only need to store (and trust) the O(1) hash of the current head block. Verification data can be queried as necessary from a (completely untrusted) Distributed Hash Table (DHT) (e.g., the gossip-layer network).
- Lite clients can "bootstrap" themselves immediately, just by selecting the correct head.
- Clients can securely query the DHT for a complete list of spendable (standard) UTXOs associated with a given address, without having to "rescan" the blockchain.
- Arbitrary-length forks can be tolerated without any need for a "reorg."
- "Proofs" of invalid transactions (e.g. a double spend) can be validated in O(log N) as well.
This proposal calls for a single additional hash to be placed in each block: the root hash of the UTXO Merkle tree. As such, it belongs in the
HardFork Wishlist. However, it could also be included in an (optionally merge-mined) alt-chain. This will require additional computation for validating a transaction - the addition will be a constant factor, since each processed txinput/txoutput already requires (even in UltraPrune), an O(log N) update to a BTree table of the UTXOs. For now, the proposal simply describes a stand-alone data structure that can be derived from existing blockchain data.
The normative specification for this protocol consists of A) a serialization (and digest) format for nodes in the balanced tree, where each leaf node contains a single UTXO, and branch nodes contain hashes of their children; and B) insertion/deletion algorithms that preserve the RedBlack invariants (balancing) of the tree, guaranteeing O(log N) worst-case effort per update.
The reference implementation is written in python, and consists of 1) a modular, extensible, and side-effect-free RedBlack tree (including unit tests) that can be augmented with arbitrary 'digest' functions, 2) a instance of this structure with the specified digest function, and 3) a script that processes actual blockchain data and applies each update (each txinput and txoutput) to the Merkle tree. This reference implementation is not at all optimized, but instead is intended to be easy-to-analyze and to provide a bit-correct gold-standard for the Merkle tree at each step.
This proposal is also a precursor to the "Eventually Zero-Trust Probabilistic Validation" scheme I described in
an earlier thread.
Related WorkByteCoin first proposed
including the hash of the current "balance sheet" in each block (i.e., the set of available unspent tx outputs), making it possible to validate transactions without needing to store the entire blockchain. Greg Maxwell later suggested
using a Merkle tree in order to query such this set more efficiently. DiThi later made
an equivalent proposal. Most recently, etotheipi suggested
using a self-balancing Merkle tree so that incremental modification to the structure would be more efficient. I scoured the academic literature, finding
a Usenix paper from 1996 by Moni Naor and Kobbi Nissim, suggesting that a RedBlack tree could be augmented with hashes instead of pointers. I wrote
a reference implementation of such an "Authenticated Data Structure". To my knowledge, it is the only available (open-source) implementation of a self-balancing hash tree.
Hash trees (of the non-balancing variety) are widely used, for example in
Git and in
Tahoe-LAFS.
Normative Protocol SpecificationA. Serialization and Digest FormatEach node in a RedBlack binary search tree node consists of 1) a key (the largest key in the left subtree), 2) left and right children (possibly null), and 3) a single bit (representing Red or Black), used for balancing 4) a value (leaf nodes only). The keys must be well-ordered in order to support efficient search queries to this structure. Two kinds of queries are supported (although more could be added): a search by (txid,index), as is needed to validate transactions, and also a search by (address,txid,index) which is useful for clients to query for the balance of an address.
There are two kinds of mappings stored in the tree:
1. UTXO Validation:
(txid, index) -> (hash of validation data)
The validation data is a subset of the transaction data corresponding to txid: a single scriptPubKey, as well as some necessary metadata (isCoinbase, and nHeight).
2. Address scanning:
(address, txid, index) -> ()
No separate value is used for this type, since all the information is contained in the unique key.
The two key types are indicated by a four-byte prefix, "UTXO" for (txid,index), and "ADDR" for (address,txid,index). The 'root-hash' is the digest of the root node of the tree. Each digest is the SHA256 hash of the serialized node. In the serialized format, the left and right subtrees are represented by their digests. The digest of an empty tree is represented by 32 zero bytes (i.e., the genesis hash). Leaf nodes and branch nodes are serialized differently, with a sentinel value ("." for leaves, "^" for branches) to distinguish between them.
Branch nodes are serialized as follows:
(color, dL, (("UTXO", txhash, index),()), dR):
[1 byte][1 byte][4 bytes][32 bytes][32 bytes][4 bytes][32 bytes]
"^" color "UTXO" H(Left) txhash index H(Right)
total: 1+1+4+32+32+4+32 = 106 bytes
(color, dL, (("ADDR", address, txhash, index), ()), dR):
[1 byte][1 byte][4 bytes][32 bytes][20 bytes][32 bytes][4 bytes][32 bytes]
[ "^" ][ color][ "ADDR"][ H(Left)][ addr][ txhash][ index][H(Right)]
total: 1+1+4+32+20+32+4+32 = 126 bytes
Leaf nodes are serialized with a different format, since the left/right digests are unnecessary, but the value must be included.
(color, (), (("UTXO", txhash, index), utxohash), ()):
[1 byte][1 byte][4 bytes][32 bytes][4 bytes][32 bytes]
"." color "UTXO" txhash index utxohash
total: 1+1+4+32+4+32 = 74 bytes
(color, (), (("ADDR", address, txhash, index), ()), ()):
[1 byte][1 byte][4 bytes][20 bytes][32 bytes][4 bytes]
"." color "ADDR" addr txhash index
total: 1+1+4+20+32+4 = 62 bytes
B. Balancing AlgorithmAny binary search tree defines three operations: insert, delete, and search. The RedBlack algorithm balances the tree during every insert and delete. Each operation is O(log N) worst-case. Since blocks will contain commitments to a particular root hash, and therefore a particular tree, it's necessary to normatively specify the particular balancing rules (there are potentially several valid ways to balance a RedBlack tree).
The particular balancing rules I have chosen were published by (Stefan Kahrs, "Red-black trees with types" Journal of functional programming, 11(04), pp 425-432, July 2001), and are fully specified in
this Haskell code. This algorithm is widely used, see for example
Kazu Yamamoto's LLRB trees,
this blog post by Matt Might, and
this blog post by Mark Chu-Carroll. These rules are based on Chris Okasaki's famous PhD thesis,
Purely Functional Data Structures, which provides four simple case-match rules sufficient for RedBlack tree
insertion (see "balance" in the Haskell specification). To implement
deletion, four additional cases must be handled (see "balleft" and "balright" in the Haskell specification).
For each transaction, first all the txinputs are deleted (in sequential order) from the tree, and second all the new txoutputs are inserted (also in sequential order). Each transaction is applied in sequential order within each block, and of course each block is applied in order as it appears in the chain.
Reference ImpementationThe reference implementation consists of three python files.
- redblack.py: an implementation of Stefan Kahrs' red-black tree, with an extensible "digest" augmentation. Two "modes" are provided for each of the three operations: the Record mode operates on a full tree (represented as nested python tuples), and produces a Verification Object (VO) (i.e., an O(log N) path consisting of just the nodes that were accessed during this operation), while the Replay mode takes in a root-hash and an untrusted VO and verifies the operation by recomputing the hash at each step.
- utxo_merkle.py: a specialization of RedBlack by definining by the digest/serialization protocol described above
- merkle_scan.py: a script that iterates through the entire blockchain, from the genesis block to the head, incrementally updating the merkle tree as it goes (requires Gavin Andresen's python bitcointools)
Additionally, some unit tests are provided, as well as a graphviz-based visualization tool that produces images like the ones in this post.
The reference implementation is not at all optimized. So far I have only run it on the first 140k blocks before getting impatient. It takes only a couple of minutes to run through the first 200k transactions. I will save discussion about optimized implementations for later posts in this thread (I invite you to help!), although I plan on updating this post at least with performance estimates as I go along.
Teaser Images (two-byte hashes)Before-and-after inserting the element 10 (full O(N) tree, Record mode)
Before-and-after inserting the element 10 (just the O(1) root hash and O(log N) untrusted VO, Replay mode)
What Now?This is not a new proposal; variations of this idea have been bouncing around since 2010. Because of its potential to improve the security and scalability of lite clients, it's important to develop this sooner rather than later. My goal is to push this process along by proposing a specific protocol and providing a reference implementation. This thread should be about the specification and implementation details (before making a "why bother?" post, please read etotheipi's thread or any the others I mentioned in Related Work).
This normative specification says very little about how the data should actually be stored or transmitted, so this may vary between implementations. Since the root node will change for every update, but leaf nodes will change infrequently, it would make sense to use a heuristic/probabilistic cache. The normative specification is very simple, so I expect other developers will have little difficulty making compatible implementations in other languages. The hardest part will likely to be implementing the RedBlack algorithm itself, which is why I tried to make my implementation simple to use as a guide and gold-standard. If you can think of a supporting feature I can build to help, (diagrams, explanations, example code, etc.), please ask for it in this thread and I will eagerly accommodate you.
I plan to gradually optimize the reference implementation, for both the O(1)-storage lite-clients and the full O(N)-storage (UltraPrune) clients, as well as to provide a scalable (untrusted) DHT that can serve tree data for any point in blockchain history. I will need to make at least some progress in this direction before I can offer any performance estimates. Here is
a profile/callgraph for the current implementation, suggesting that less than 2% of the time is spent computing hashes (most of the time is spent in my sketchy pattern-matching function, so that's the first thing to work on).
Both etotheipi and gmaxwell would prefer to use tries rather than balanced binary trees for reasons I consider uncompelling. Tries (at least ones with Merkle hashes) are likely to require more space and computation than the balanced trees. Etototheipi likes the insertion-order-independence of a trie, although this is irrelevant if each contains a commitment to a root hash (since you can find the previous root hash in the previous block). I don't know why gmaxwell prefers tries. Either way, it's their turn to provide a competing implementation!