A
proposal to help SPV nodes verify blocks is to commit the root of an (unbalanced) Merkle tree containing all unspent (and spendable) transaction outputs.
It is possible to create proofs of validity for each modification of the set. The proof would prove that inserting an entry into a tree with a root of X will give a new tree with a root of Y. There would also be proofs for removing entries.
This means that each modification to the UTXO set could have a proof associated with it. A block could contain transactions and proofs. Each proof would be linked to an input or output of a transactions. For each input, there would be a proof of removal from the set and for each output, there would be a proof of insertion into the set. Unspendable outputs would not be inserted into the set. (This would require a formal definition of unspendable, probably outputs which spend to "OP_RETURN ...").
Thinking out loud about the proofs, this is what I was thinking.
Verifiable Merkle SetThere are 2 kinds of nodes, leaf and branch nodes. Each node has a digest associated with the node.
Branch NodesThese nodes have exactly 2 children. The data storage for these nodes is as follows.
int256 digest;
MerkleNode *parent;
MerkleNode *left;
MerkleNode *Right;
uint16 height;
Assuming 4 byte pointers, each node requires 46 bytes (or 58 for 8 byte pointers).
The digest for branch nodes is hash( 0x00 | left->digest | right->digest | node->height ).
The height value counts from leaf level. 2 leaves that differed only in the LSB of their entry_digest would have a height 1 parent. The highest height is 256. This is for leaves which differ in the MSB of their entry_digest.
The leading 0x00 in the array passed to the hash function is to indicate a branch node.
Leaf NodesThese nodes are the leaves of the tree. They have no children. The data storage for these nodes is as follows.
MerkleNode *parent;
int256 entry_digest;
The digest for the node is hash( 0x01 | hash(entry_value) ) but only hash(entry_value), i.e. the entry_digest, is stored in the tree. The node's digest can be recomputed, as needed, saving RAM.
The leading 0x01 indicates a leaf node. Leaf and branch nodes can be distinguished from each other due to the prefix.
Root DigestIf the set is empty, the digest is defined as all zeroes. Otherwise, it is defined as the digest of the root node of the tree.
Proof of ExistenceA proof of existence proves that a particular entry is in the set. This proof is a byte array defined as follows.
uint16 length;
uint16[length] heights;
int256[length] path_digests;
The proof is a Merkle branch that starts at the leaf and follows the path to the root. It includes the height values for the path.
The height value is used to determine which bit in the entry_digest to use to determine if the path goes left or right at that point.
Since the height value is included as input into the branch node's hash, it is hash sealed.
If the root digest is zero, then it means that there are no entries in the set, so a proof of existence always fails.
if (root_digest == 0) {
return false;
}
digest = hash(0x1 | entry_digest);
for (i = 0; i < length; i++) {
boolean right_child = entry_digest & (((uint256) 1) << heights[i]) != 0;
if (right_child) {
digest = hash(0x01 | path_digests[i] | digest | height);
} else {
digest = hash(0x01 | digest | path_digests[i] | height);
}
}
return digest == root_digest;
Removal ProofsIt is possible to compute the new root digest from the existence proof of the leaf node to be removed. The effect of node removal is that the leaf node and the node's (branch node) parent are removed from the tree. The node's sibling replaces the node's parent in the tree.
If the path length is zero, then the node is the only node in the tree.
< verify the existence proof >
if (length == 0) {
return new_root_digest == 0; // tree should be empty after removal
}
digest = path_digests[0]; // Sibling's digest
for (i = 1; i < length; i++) {
boolean right_child = entry_digest & (((uint256) 1) << heights[i]) != 0;
if (right_child) {
digest = hash(0x01 | path_digests[i] | digest | height);
} else {
digest = hash(0x01 | digest | path_digests[i] | height);
}
}
return new_root_digest == digest;
Insertion ProofsAs with removal proofs, it is possible to compute the new root digest from an existence proof. At each branch, the proof should choose its path based on the entry_digest of the entry being inserted. The digest of this leaf is called the leaf_entry_digest.
Since the entry_digest is used to select the path, the leaf_entry_digest must have the same bit as the entry_digest for the height of each branch node along the path.
All entry_digests of leaf nodes that are decendants of a branch node have the same most significant bits, begining at a height one greater than the branch node's height. This is inherent due to the tree being a binary tree.
This means that the most significant bits of the leaf_entry_digest can be used to determine the prefix for each branch node. The bits starting at one higher than the height of the branch node indicate the prefix for that branch node.
The new branch node must be inserted directly below the lowest branch node that has a prefix that matches the entry_digest.
Consider the insertion of a node with an entry_digest of 01011010b (assuming the hash function produced single byte digests). The top branch node in the tree has a height of 7. Since bit 7 is one (LSB is bit 1) in the entry_digest, the right path is used. This leads to a branch node with a height of 3. Since bit 3 is 0, the left path is used. The next node is the leaf. It has a leaf_entry_digest of 01110000. This is guaranteed to match the entry digest for bits 7 and 3 but can be anything else for the other bits.
The prefix for the height 7 branch node is 0 (all bits in the leaf_entry_digest above bit 7). This matches the entry_digest, so the new node will be a descendant of that branch node.
The prefix for the height 3 branch node is 01110. This doesn't match the MSBs of the entry_digest. This means that the entry_digest is not a descendant of that branch node.
The new branch node must be inserted as a child of the height 7 branch node instead of the height 3 branch node.
Comparing the 2 digests:
01011010 (entry_digest)
01110000 (leaf_entry_digest)
The highest bit that is different is the bit with height 6. This means that the new branch node has a height of 6 (which is between heights 3 and 7).
A new leaf node is produced for the new entry. It is the left child of the new branch node. The height 3 branch node is the right child.
The pointer from the height 7 branch node that pointed to the height 3 branch node is redirected to the new branch node.
< verify the existence proof first >
if (old_root_digest == 0) {
if (length != 0) {
return false; // proofs must be zero length when inserting into an empty tree
}
return new_root_digest == entry_digest;
}
// Find the highest bit that is different between the leaf_entry_digest and the entry_digest
uint16 insertion_height = find_highest_bit_difference(leaf_entry_digest, entry_digest);
// Compute the branch from the leaf to below the insertion point
int256 digest = hash(0x01 | leaf_entry_digest);
int i;
for (i = 0; i < length; i++) {
if (heights[i] > insertion_height) {
break;
}
boolean right_child = entry_digest & (((uint256) 1) << heights[i]) != 0;
if (right_child) {
digest = hash(0x01 | path_digests[i] | digest | height);
} else {
digest = hash(0x01 | digest | path_digests[i] | height);
}
}
int256 new_digest = hash(0x01 | entry_digest);
boolean right_child = entry_digest & (((uint256) 1) << insertion_height) != 0;
if (!right_child) {
digest = hash(0x01 | new_digest | digest | height);
} else {
digest = hash(0x01 | digest | new_digest | height);
}
// Finish the path from after the insertion point to the root
for (; i < length; i++) {
boolean right_child = entry_digest & (((uint256) 1) << heights[i]) != 0;
if (right_child) {
digest = hash(0x01 | path_digests[i] | digest | height);
} else {
digest = hash(0x01 | digest | path_digests[i] | height);
}
}
return new_root_digest == digest;
ReplacementReplacement proofs contain a deletion and insertion proof, since the new leaf node cannot be placed in the same location. This doubles the proof size.
In the reference client, the UTXO set is compressed. All the outputs for a single transactions are stored together. This could be supported by replacement. When an output is spent, the old version of the transaction's entry is removed and a new one is added with that output marked as cleared.
This trades off RAM verses hard drive space. The proofs associated with a block would need to be stored, but need not be stored in RAM. If replacement is used, then disk space is doubled, but RAM usage is reduced.
The number of branch nodes in a tree is equal to the number of leaves (minus 1). This means the amount of RAM would be roughly doubled (assuming the data associated with each leaf is around the same as the size of a digest).
In the reference client, the UTXO set is stored in a hash map that maps a 32 byte txid to a compact representation of the transaction outputs. Hard coding that compact represenation as a network rule might be overly limiting. If a new output type became popular, it would not compact it much.