Pages:
Author

Topic: Ultimate blockchain compression w/ trust-free lite nodes - page 8. (Read 87939 times)

legendary
Activity: 2142
Merit: 1010
Newbie
Could anyone let us know the current progress in implementation of this idea?
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
For me, this is the most exciting thread on this forum.

Smiley I've actually received some pressure to start implementing this myself, with some urgency.  I have resisted solely because I'm totally swamped with other things, and expect I'll get into it in about 6 months.  And I felt guilty about that, but I have some personal/selfish reasons for that.

But now I don't feel so bad.  It seems like, once every month, I have some revelation about how this could be improved, or solving some aspect of it that I wasn't sure how to solve earlier.  Now I am comfortable with the downloading from unsynchronized peers and/or having multiple blocks generated while downloading that data, and I feel like I have a really good way to encode this with high-space efficiency.  This is making it all the easier for me to imagine implementing this, when I finally have time.  Or maybe someone else will.



Talking about the non-sync'd downloading (link in the previous paragraph), I just wanted to add a comment:  I noticed that LevelDB has read-snapshots, and it looks like other DB engines do, too.  (Do most of them?).  It certainly would simplify this even further. For instance, consider that I ask a node to send me some branch of the tree.  Two new blocks come in since the download started, that cause that peer to update the tree while it is in the process of sending it to me.  In a completely naive system, I would end up with internally inconsistent data, and no good way to get it without getting lucky to have no new blocks while downloading.

However, if you are using a read snapshot, you can essentially freeze the DB state in time so that you can read it's contents without worrying about any updates since it was frozen.  You just throw away the snapshot when you're done.  I assume it does this efficiently, by essentially storing the difference data between when you took the snapshot, and rewind those differences when you retrieve data from the tree.  This makes everything even more feasible.
legendary
Activity: 1078
Merit: 1003
For me, this is the most exciting thread on this forum.
legendary
Activity: 1232
Merit: 1094
I had a little revelation last night, while thinking about this proposal.  In hindsight, it seems so simple.  But hindsight is always 20/20, right?  My thought process was:  I've implemented RAM-based PATRICIA trees before, but what's a good way to do this on disk?  For instance, I want to implement this in LevelDB, so I need some way to make LevelDB behave like a memory space.

Assuming there are 4 billion UTXOs, that means that the tree will be dense for the first 32 bits on average.  All leaf nodes will have 256 - 32 = 224 bits of data each.

If you just store all the transaction hashes in the tree in full, then you need 32 bytes per entry, instead of 28 bytes, so you aren't really saving much.

Having a fixed 32 bytes per entry would mean that the file has fixed width entries, which would make seeking easier.

The only exception are outputs for the same transactions.  Each leaf could have a list of outputs and how much coin in each.  This breaks the fixed field length though.

The UTXO-id would be {tx-hash, out-index, value}.

You effectively save

{tx-hash, total-value}, {out-index, value}, .... {out-index, value}, {end-delimiter}
legendary
Activity: 1896
Merit: 1353
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
I had a little revelation last night, while thinking about this proposal.  In hindsight, it seems so simple.  But hindsight is always 20/20, right?  My thought process was:  I've implemented RAM-based PATRICIA trees before, but what's a good way to do this on disk?  For instance, I want to implement this in LevelDB, so I need some way to make LevelDB behave like a memory space.

One of the other issues with the PATRICIA/hybrid approach is the that there's a lot of data needed to store pointer lists, etc.  It does have quite a bit of overhead.  And you don't want to optimize it in such a way that limits the generic-ness of the structure. I'd prefer to maintain the textbook-generic-ness of this data-structure, and let implementations do their own optimizations as long as they can convert and reproduce the same calculations.  

The revelation was that you don't need to replicate a memory space with abstract pointers to each trie-node and leaf.  You can store them based on their node-prefix value, and the DB will auto-sort the values in depth-first-search order.  For instance, let's take this structure:



All you need to is store everything by its prefix.  Here's what the DB entries would look like:

Quote
Key -> Value
""     -> RootHash, SumValue, 3, "1", "3", "6"
"1"    -> NodeHash, SumValue, 2, "1", "3"
"11"   -> NodeHash, SumValue, 2, "2", "3"
"1122" -> LeafHash, Value
"1137" -> LeafHash, Value
"1342" -> LeafHash, Value
"3333" -> LeafHash, Value
"678"  -> NodeHash, SumValue, 3, "0", "5", "9"
"6780" -> LeafHash, Value
"6785" -> LeafHash, Value
"6789" -> LeafHash, Value

Each "numChildren" value (after the SumValue) can be exactly one byte, because you never have more than 256 ptrs, and each child pointer is also exactly 1 byte.  If you want to jump to a particular child, for instance, you are at node "11" and want to go the child at 3, you simply do iter->Seek("11"+"3") and it will skip "1122" and put the iterator right at "1137", which is the first database value >= "113".


Furthermore, you might be able to get away without even any pointers!  You might just store the node/leaf hash and value, and know about children after the fact, simply by continuing your iteration.  You are at IterA, and IterB=IterA.Next().   You know that IterB is a child node of IterA because IterB.key().startswith(IterA.key()).   That's stupid simple.  

So, you know what level you're at simply by looking at Iter.size()
So, you know that you are a child because IterNext.key().startswith(IterPrev.key()).
If the previous check fails, you know you finished traversing that branch and you can update IterPrev.

Though, there may be something I'm missing that would still require you to store the pointers.  But it's still a lot better than storing 6-8 bytes per pointer, which was originally where I thought the bulk of the data was originally going to end up.

Even better, you don't really have to implement the minutiae of the PATRICIA tree, because it's kind of done automatically by the nature of a key-sorted database.  The database inserts everything in the correct place for you, and it just so happens that tries and PATRICIA trees get iterated the same way, without having to store structure information.  On the contrary, a depth-first search on a BST will also be sorted this way but you have to store data at each node about the local structure of the tree, and update all the nearby nodes if there's a rebalance.  Since the PATRICIA tree has a deterministic structure based solely on the inclusive set, you can insert and remove nodes without any extra seek/updates, and natural iteration over the dataset will result in the right answer as if you implemented a full PATRICIA tree.
legendary
Activity: 1232
Merit: 1094
There's not a way around this, other than just having full nodes store the entire trees.  Which means we're back to square one Sad

That info has to be stored.  I see it that the live tree (and maybe 50-100 blocks of history) needs to be stored.

However, you could do it in a distributed fashion.  Every node in the tree has to be stored somewhere.

The spender could provide the old path and a new path that was correct within the last 50 steps.  The top of the tree, which would be change every block would be live for all full nodes anyway.

You only have to look at transactions that start with the same prefix as yours to see if the hash to the root changes.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
gmaxwell pointed out the obvious flaw in this proposal:  you can supply the input branches to prove that the TxOuts that you are spending exist, but you can't supply the destination branches.  Otherwise, full nodes have no idea how to update the sub-branches of the target address.  Even if they know that this is the first UTXO for that address, there may be lots of other branches on the way down to that node which are unknown to it.

There's not a way around this, other than just having full nodes store the entire trees.  Which means we're back to square one Sad
legendary
Activity: 1232
Merit: 1094
Another feature (or disadvantage) is that it allows dropping of extra info added into the blockchain.

For the system to work, all you need is lots of sha(sha(value)) to value mappings.  The values are always 2X36 bytes.  Values are always "hash(child1);coins(child1);hash(child2);coins(child1)".

This means that there is no bloat.  It is up to the coin owner to keep the full transaction data and they only submit it for spending.

You can still timestamp documents, but not add data to the blockchain as a permanent record.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
Right.  Just use the transaction hash directly as key and accept that there might be imbalances.  However, the imbalances are not really going to happen because you are using a hash (effectively random) value as key.  So the law of large numbers does tree balancing for you.

Just to clarify:  tries/patricia trees/de la brandais trees do not have balancing issues.  They are all tightly bounded to a maximum number of operations to do queries, inserts, deletes.  It's just that the optimizations of PATRICIA/Brandais bring you far below that constant upperbound.  Thus, "unbalancing" simply removes optimization, but you're still operating well within the confines of constant time, no matter what the tree structure looks like.  

The distinction only matters if there was reason to believe that those optimizations are necessary to make this idea feasible.  I do not believe that is the case, here.   In terms of access times, I believe even a regular-old trie (forcing full traversal of each path) would still work. 

legendary
Activity: 1078
Merit: 1003
And wouldn't this also remove tractability since now other nodes only have a fingerprint of the transaction(s) with which you received your coins and not the entire history anymore? I like this idea a lot on the surface.
legendary
Activity: 1232
Merit: 1094
This is yet another area where the trie-based structures win -- since there is branch independence in tries, peers don't have to store what's below a certain node as long what's below it is not changing (they only need the fingerprint of that node after the last update to it).  

Right.  Just use the transaction hash directly as key and accept that there might be imbalances.  However, the imbalances are not really going to happen because you are using a hash (effectively random) value as key.  So the law of large numbers does tree balancing for you.

The way I would see it is that the owner stores the historical path and the network stores the live state of the tree.  However, when your client connects it boosts the local part of the tree near its coins.  Telling nodes about the local area helps secure your own coins.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
I was thinking about this as a pure system that would allow dropping the blockchain storage (except for recent blocks) completely.  Effectively, the data associated with coins would be moved to the owners of the coins.

You need to store the merkle path down to the transaction that created the coin.  This won't change with time.

However, to spend it you also need live path from the latest root.

This tree could be stored in a distributed fashion.  It only needs to be stored for a while, but the longer the better.

Your client could try to keep track of the tree down to your transactions.  The amount of data depends on the number of transaction outputs.  If there was 1 trillion outputs, then you would need to store around 40 levels until yours was the only transaction left on the path.

This works out at 40 * 32 bytes = 1280 bytes for per block per coin.  This data rate could be decreased by combining all your coins into one and/or reducing the time between asking for the updated data.

The longer the network stores update data, the less often nodes need to update.

That's a super interesting idea.  I'll have to sleep on that one.

Right now, everyone tracks everyone's coins, and you are responsible for maintaining your own private keys.  In your system, you are maintaining your own private keys and the subtree information about the coins contained within.   Nodes don't have to keep that information, because they only need it when the coins are being spent, and you are the only who ever spends the coins.  

Sure part of the UTXO set could be "lost" if your computer crashes, but those coins are lost, too, so it doesn't really matter...?  If the coins will never be spent, the sub-tree never changes, and so nodes don't care what's inside that subtree, as they have its fingerprint.   I have not had time to think through the implications (or complications) of it, but it's a super-interesting thought-experiment.  


P.S. -- This is yet another area where the trie-based structures win -- since there is branch independence in tries, peers don't have to store what's below a certain node as long what's below it is not changing (they only need the fingerprint of that node after the last update to it).  If you use a BST, this doesn't work, because updates in neighboring branches can cause rebalances, forcing you to know what's in this branch so that nodes can be shuffled between branches.

legendary
Activity: 1232
Merit: 1094
I was thinking about this as a pure system that would allow dropping the blockchain storage (except for recent blocks) completely.  Effectively, the data associated with coins would be moved to the owners of the coins.

You need to store the merkle path down to the transaction that created the coin.  This won't change with time.

However, to spend it you also need live path from the latest root.

This tree could be stored in a distributed fashion.  It only needs to be stored for a while, but the longer the better.

Your client could try to keep track of the tree down to your transactions.  The amount of data depends on the number of transaction outputs.  If there was 1 trillion outputs, then you would need to store around 40 levels until yours was the only transaction left on the path.

This works out at 40 * 32 bytes = 1280 bytes for per block per coin.  This data rate could be decreased by combining all your coins into one and/or reducing the time between asking for the updated data.

The longer the network stores update data, the less often nodes need to update.
full member
Activity: 182
Merit: 100
This certainly sounds like a very well engineered solution, any idea if this will ever be implemented?
legendary
Activity: 1232
Merit: 1094
It needs to be a sum-tree over txout value so that the utxo root also shows the currency hasn't been inflated and allows stochastic utxo checks. Thats easy enough, just an implementation detail.

So, the hash becomes 36 bytes instead of 32.  The extra 4 bytes is the total in satoshis of UTXO.

For a node with 1 child
hash(node) = hash(child)

For a node with 2 children
hash().hash -> the 32 byte hash (sha256 squared)
hash().value -> total value (unsigned int  (max 21 million))
hash() -> hash().hash : hash().value

hash(node).hash = sha256(sha256(hash(child1) : hash(child2)))
hash(node).value = hash(child1).value + hash(child2).value

Quote
The other thing is that we need some way of testing that randomly selected transactions in a block are spending an input that was in the uxto set (not spending coins from thin air), thats easy— except when they're spending txouts in the same block. Is there a way to accommodate that without requiring a separate by-output lookup against transactions in the blocks as a p2p message?

That actually seems to be a sub-problem of the main problem that we are trying to solve, but at the sub-block level.  You can prove the TX out was created by transaction 10 in the current block, but if it is used in transaction 100, how do you know if it was spent between transactions 11 and 99?

I think a merkle tree for all the intermediate states of the block would be required.  This actually might not be that big a deal.  The miner would already have to add and remove all the UTXOs from/to the the tree anyway.  This would require adding them to an array and then computing the merkle root anyway.

The block header (or alt chain header) would have 2 new fields
Root of UTXO tree
Merkle Root of roots of the UTXO tree for all transactions in this block
hero member
Activity: 668
Merit: 501
The other thing is that we need some way of testing that randomly selected transactions in a block are spending an input that was in the uxto set (not spending coins from thin air), thats easy— except when they're spending txouts in the same block. Is there a way to accommodate that without requiring a separate by-output lookup against transactions in the blocks as a p2p message?

i would think there is an obvious, straightforward way.
lets say a node requests matching transactions from a different node who has already recieved the latest block, using bloom filtering.
if it has dependencies outside of the the UTXO, all parent transactions are also automatically provided until there are all included, even if they do not match the bloom filter.
if he would wait until the next block, the case becomes "trivial" again.
staff
Activity: 4284
Merit: 8808
So some discussions from the inflationproofing thread provided two additional requirements for the UTXO tree:

It needs to be a sum-tree over txout value so that the utxo root also shows the currency hasn't been inflated and allows stochastic utxo checks. Thats easy enough, just an implementation detail.

The other thing is that we need some way of testing that randomly selected transactions in a block are spending an input that was in the uxto set (not spending coins from thin air), thats easy— except when they're spending txouts in the same block. Is there a way to accommodate that without requiring a separate by-output lookup against transactions in the blocks as a p2p message?
legendary
Activity: 1078
Merit: 1006
100 satoshis -> ISO code
As an aside, roughly, what is the total number of unspent transaction outputs at the moment?

It would also be interesting to know what percentage are
legendary
Activity: 1232
Merit: 1094
As an aside, roughly, what is the total number of unspent transaction outputs at the moment?
Pages:
Jump to: