I was thinking that to reduce the number of hashes a lightweight client would need to download when proving inclusion, you could replace the concatenation of the children's values in the formula for the value of branch node B with their Merkle root. Then a lightweight client would only need log2(256) = 8 hashes per node to prove inclusion in the worst case scenario, instead of 256.
The downside is having to store more hashes (twice as many, worst case), but it actually appears to make updates faster, I guess due to fewer hashes needing to be accessed from memory. This is because all node values updated during insertion/deletion/leaf update, except the one being attached to or removed from, have only a single leaf in the Merkle tree updated => no unbalancing issues, and only a single Merkle branch needs to be updated.
That's a very good idea, as I've been concerned about how to reduce the number of hashes that need to be transferred for a lite-node to get its balance. I had spent some time looking for "associative" hashing functions. If they existed, the transfers would be tiny: if a node is known to have Hash(A|B|C|D|E|F|G|H|I|J), and you want to prove inclusion of G, you simply supply {Hash(A|B|C|D|E|F), G, Hash(H|I|J)}. So you would need to transfer at most 3 hashes per level. Unfortunately, the only associative hash function I found was based on matrix multiplications, and was most definitely not cryptographically secure
So, I welcome the idea of per-node merkle trees. I wonder if the trees really need to be stored, though. Hashing is so fast that recomputing the merkle tree on-demand may be fine, especially for the sparse, lower-level nodes. Of course, the top couple levels could/should be cached since they'll get hit all the time and are pretty dense.
However, I'm not sure if I agree/understand the point about updating only Merkle branches. Because the linked-lists at each node in the tree are sorted, the deletion/insertion of a node is likely to occur in the middle of the list and reverse the parity/pairings -- i.e. you started with {A,C, D,H, J,M, Q,Y} -- the bottom level pairs (A,C), (D,H), (J,M) and (Q,Y). Now, you insert E and the list now looks like: {A,C D,E, H,J, M,Q, Y,Y} which means that
all branches to the right of the insertion or deletion need to be recalculated.
On the other hand, you may be recomputing sparse nodes all the time, anyway (meaning this recomputation will happen regardless), and the dense higher-level nodes could be batch-updated -- you know that a given block is going to modify 77 of your 256 branches at the top level, so you don't need to recompute the tree until you have all 77 children are complete.
That's an interesting paper, though I have trouble envisioning how they could be made deterministic for the purposes of authenticated structures. Maybe I just need to read the paper a little closer. Unfortunately, it's late so I will have to come back to this, later.