I have started to experiment with this idea.
My goal is to add this hash tree to Electrum.
Each "numChildren" value (after the SumValue) can be exactly one byte, because you never have more than 256 ptrs, and each child pointer is also exactly 1 byte. If you want to jump to a particular child, for instance, you are at node "11" and want to go the child at 3, you simply do iter->Seek("11"+"3") and it will skip "1122" and put the iterator right at "1137", which is the first database value >= "113".
Pointers can also be encoded as bits, using a fixed-size 32 bytes vector (assuming 256 pointers).
Of course variable-length storage would be more efficient, because most nodes will have sparse children, but I don't know if it is really worth the effort.
Indeed, keys will take up to 20 bytes, and node hashes will take 32 bytes anyway, so we're not adding an order of magnitude by using 32 bytes.
Furthermore, you might be able to get away without even any pointers! You might just store the node/leaf hash and value, and know about children after the fact, simply by continuing your iteration. You are at IterA, and IterB=IterA.Next(). You know that IterB is a child node of IterA because IterB.key().startswith(IterA.key()). That's stupid simple.
So, you know what level you're at simply by looking at Iter.size()
So, you know that you are a child because IterNext.key().startswith(IterPrev.key()).
If the previous check fails, you know you finished traversing that branch and you can update IterPrev.
Though, there may be something I'm missing that would still require you to store the pointers. But it's still a lot better than storing 6-8 bytes per pointer, which was originally where I thought the bulk of the data was originally going to end up.
You can indeed do it without pointers, but iterating to find the children of a node can be very long.
And you will need to find the children of a node everytime you update its hash.
My point was you don't need any pointers at all, and finding the children isn't actually that long since the database is efficient at these kinds of operations. If you are node "ABCD" and want to go to pointer P, you don't need a pointer to know how to get there. Just iter->Seek("ABCDP") and you'll end up at the first elemtent equal to or greater than it. At the deeper levels, the iterators will efficiently seek directly in front of themselves, and may already have your next target in cache already.
If it starts with "ABCD" you know you are still in a child of ABCD, and if not, you know you are in a parallel branch and can finish processing the "ABCD" node. Yes, there may be a lot of seek operations, but with the built-in optimizations, there's a very good chance that they will be fast, and because it's a PATRICIA tree, you'll rarely be doing more than 6 such operations to get the branch updated.
On the other hand, I haven't thought this through thoroughly. I only know that it seems like you can avoid the pointers altogether which I was expecting to make up the bulk of the storage overhead. i.e. each node currently will only hold a sum (8 bytes) and its own hash (32 bytes). If you need the pointers, you could end up 256, 8-byte pointers per node in addition to it, which is actually quite heavy at the higher, denser levels.
After lots and lots of discussion and debate, I believe that the address index should be maintained as a trie-like structure.
It's possible to create a transaction that has no address at all. What is considered the address in this case?
There's a liitle room for negotation on this topic, but ultimately and "address" is a TxOut script. In a totally naive world, your "addresses" would just be the exact serialization of the TxOut script -- so a 25-byte "address" for each standard, Pay2Hash160 script. Or 35 or 67 for pay-to-public-key scripts. 23 bytes for a P2SH script. And then anything that is non-standard would be simply serialized raw.
However, I don't like this, because a single address ends up with multiple equivalent representation. Even though pay-to-public-key scripts are rare, there are addresses that use both (such as multi-use addresses that were used for mining and regular transactions). Even though it's rare, you'd have to ask your peers for 2 different scripts per address (the Pay2Hash160 and PayToPubKey scripts). I'd almost prefer making special cases for these addresses, given that they are so standard and fundamental to Bitcoin transactions.
So, I would vote for:
{Pay2Hash160, Pay2PubKey65, Pay2PubKey33} all be serialized as 21 bytes: 0x00 + Hash160. Any Pay2PubKey variants will be bundled under that single key.
{P2SH} scripts will be serialized as 21 bytes: 0x05 + Hash160{script}.
{EverythingElse} Will simply be the raw script.
One problem I see with this is that it doesn't make it clean to adopt new standard scripts, without reconstructing the database in the future. I suppose it wouldn't be the end of the world, but we also don't want to make an inflexible
protocol decision. This isn't just personal preference for storing address/scripts, it's actually describing the authenticated structure of the Reiner-tree. So if we would be adding a new std script type, and we'd want a short form of it to store in the DB, we'd have to update the "protocol". If this had been adopted already, that would be a hard fork. If we just do raw scripts all around, this isn't really a problem, except that we may have to ask for extra branches to make sure we get all possible variants of a single public key.
@ ThomasV
I noticed you asked something about "SumValue" before. I don't know if you got the question answered, but the idea was to recursively store the sum-of-value of each sub-branch, and have it authenticated along with the hashes. Quite a few users, including gmaxwell (who was original only luke-warm on this whole idea), determined that was an extremely valuable addition to this spec to deal with miners who lie about their reward knowing that the network is made up almost entirely of lite-nodes who have no way to determine otherwise. But those lite nodes know what the
total coins in circulation should be, and thus would only have to look at the root-sum-value to determine if someone cheated.
I don't know if I completely captured that concept. I'm sure someone like gmaxwell or d'aniel can jump in and explain it better. But it is an easy add-on to the original idea. And also makes it possible to simply query your balance without downloading all the raw TxOuts (though, if you are using each address once, that doesn't actually save you a lot).