I had a little revelation last night, while thinking about this proposal. In hindsight, it seems so simple. But hindsight is always 20/20, right? My thought process was: I've implemented RAM-based PATRICIA trees before, but what's a good way to do this on disk? For instance, I want to implement this in LevelDB, so I need some way to make LevelDB behave like a memory space.
One of the other issues with the PATRICIA/hybrid approach is the that there's a lot of data needed to store pointer lists, etc. It does have quite a bit of overhead. And you don't want to optimize it in such a way that limits the generic-ness of the structure. I'd prefer to maintain the textbook-generic-ness of this data-structure, and let implementations do their own optimizations as long as they can convert and reproduce the same calculations.
The revelation was that you don't need to replicate a memory space with abstract pointers to each trie-node and leaf. You can store them based on their node-prefix value, and the DB will auto-sort the values in depth-first-search order. For instance, let's take this structure:
All you need to is store everything by its prefix. Here's what the DB entries would look like:
Key -> Value
"" -> RootHash, SumValue, 3, "1", "3", "6"
"1" -> NodeHash, SumValue, 2, "1", "3"
"11" -> NodeHash, SumValue, 2, "2", "3"
"1122" -> LeafHash, Value
"1137" -> LeafHash, Value
"1342" -> LeafHash, Value
"3333" -> LeafHash, Value
"678" -> NodeHash, SumValue, 3, "0", "5", "9"
"6780" -> LeafHash, Value
"6785" -> LeafHash, Value
"6789" -> LeafHash, Value
Each "numChildren" value (after the SumValue) can be exactly one byte, because you never have more than 256 ptrs, and each child pointer is also exactly 1 byte. If you want to jump to a particular child, for instance, you are at node "11" and want to go the child at 3, you simply do iter->Seek("11"+"3") and it will skip "1122" and put the iterator right at "1137", which is the first database value >= "113".
Furthermore, you might be able to get away without even any pointers! You might just store the node/leaf hash and value, and know about children after the fact, simply by continuing your iteration. You are at IterA, and IterB=IterA.Next(). You know that IterB is a child node of IterA because IterB.key().startswith(IterA.key()). That's stupid simple.
So, you know what level you're at simply by looking at Iter.size()
So, you know that you are a child because IterNext.key().startswith(IterPrev.key()).
If the previous check fails, you know you finished traversing that branch and you can update IterPrev.
Though, there may be something I'm missing that would still require you to store the pointers. But it's still a lot better than storing 6-8 bytes per pointer, which was originally where I thought the bulk of the data was originally going to end up.
Even better, you don't really have to implement the minutiae of the PATRICIA tree, because it's kind of done automatically by the nature of a key-sorted database. The database inserts everything in the correct place for you, and it just so happens that tries and PATRICIA trees get iterated the same way, without having to store structure information. On the contrary, a depth-first search on a BST will also be sorted this way but you have to store data at each node about the local structure of the tree, and update all the nearby nodes if there's a rebalance. Since the PATRICIA tree has a deterministic structure based solely on the inclusive set, you can insert and remove nodes without any extra seek/updates, and natural iteration over the dataset will result in the right answer as if you implemented a full PATRICIA tree.