Pages:
Author

Topic: Ultimate blockchain compression w/ trust-free lite nodes - page 7. (Read 87868 times)

legendary
Activity: 1232
Merit: 1083
If I understand (remember) correctly, Socrates1024 is making the point that someone could theoretically do mining without having the full blockchain.  All they need is the root hash of the last block, and to see the full blocks coming in.  This is because you can verify that the UTXOs being spent by the new transactions are unspent with a simple branch request from a peer.

I think providing a way to pre-package blocks would help here.  You send the block you are mining in advance (subject to spam protection).

Another option is to allow transactions to be packaged.  You send a hash + 64 transactions.  Later, you can include all 64 just with the header hash.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
I'm not doubting that it's possible, just wondering if you'd actually be gaining anything. I suppose the application would be for devices with suitable bandwidth but very tight memory constraints, like hardware wallets?

If I understand (remember) correctly, Socrates1024 is making the point that someone could theoretically do mining without having the full blockchain.  All they need is the root hash of the last block, and to see the full blocks coming in.  This is because you can verify that the UTXOs being spent by the new transactions are unspent with a simple branch request from a peer.

Right now, if you wanted to do this... you can't.  You can easily prove that a UTXO once existed, but you don't know it was spent since then without downloading all the blocks since then and verifying no spends.  But with this UTXO tree structure, you can prove both inclusion and exclusion of UTXOs on any given block.  It may not be that big of a distinction today, but perhaps in the future when you require TB of storage, that could make a difference. 

But I do question how much is gained -- since only some upper limits ratio of the network could operate like this without being extremely burdensome on the actual full nodes.  And I'm not confident that these lite nodes could expect reliable branch information from each other even if they all "agreed" to hold, say, 1% of the full thing.  I certainly wouldn't want to risk my miner becoming idle because it's having trouble finding some subset of the tree (or large subsets of the network going dark for that reason)

Have I understood this properly?



One thing that has come up before, is that I'd like to see an additional piece of information added to the header:  fast-forward data each block.  sipa has already implemented undo data because you need it in the event of a reorg, and I tried to convince him he might as well include fast-forward data, because you save some 75% bandwidth on transmission of block data (to those that request it).  If the OutPoints-removed-and-TxOuts-added data is organized into a merkle tree, then that root could be included in the merkle tree, and such UTXO-only nodes would be able to avoid pulling whole blocks.  sipa didn't like the complexity for a "constant" factor, but I think 75% is worthy constant factor.  Unless I'm missing something...
legendary
Activity: 905
Merit: 1011
I'm not doubting that it's possible, just wondering if you'd actually be gaining anything. I suppose the application would be for devices with suitable bandwidth but very tight memory constraints, like hardware wallets?
full member
Activity: 126
Merit: 108
Andrew Miller
IMO your 3-point plan is missing out on the best aspect of having a Merkle UTXO. It's unnecessary for Level 2 to be a weaker security "SPV+", it can still do Full Validation even without having to do download the whole chain. The reason why is that you can check EVERY TX IN A BLOCK (not just ones involving your wallet) just by knowing the root hash of the UTXO and requesting short checkable proofs from untrusted nodes.

At one point I convinced etotheipi of how this works, and he basically said he hadn't realized that it would be possible. https://bitcointalksearch.org/topic/storing-utxos-in-a-balanced-merkle-tree-zero-trust-nodes-with-o1-storage-101734 I made a reference implementation using a redblack tree but it would be totally fine to substitute a merkle patricia trie for it.
legendary
Activity: 905
Merit: 1011
legendary
Activity: 2618
Merit: 1006
I might start mining again if this gets implemented! Smiley
Also as long as there is a UTXO block merge mined every few hours or even days this is still much better/faster than the current situation with checkpoints + the bootstrap.dat torrent.

By the way, thanks for not rushing this and debating about proper solutions longer as this is in my opinion one thing that can drive forward bitcoin adoption a LOT more than another porn site accepting them! The frustration and confusion when starting a full client for the first time for sure is one of the major reasons why this is still seen as a "geek tool". It is important to get this right on the first try.
legendary
Activity: 1120
Merit: 1149
In case anyone misses 2112's remarkably obscure troll with regard to MUMPS, see this: http://thedailywtf.com/Comments/A_Case_of_the_MUMPS.aspx
legendary
Activity: 2128
Merit: 1065
I originally wrote this post couple of days ago in response to another humorous/pithy comment. However the original comment got deleted before I finished editing it and it made almost no sense afterwards. It appears that this thread is now approaching its end and I decided to post re-edited version of it to put it in a permanent public record.

The original Greenspun's tenth rule of programming states:
Quote
Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.
I propose a modified version of it pertaining to Bitcoin and other blockchain-based cryptocurrencies:
Quote
Any sufficiently complete attempt at implementing Bitcoin will contain an ad hoc, informally-specified, bug-ridden, slow implementation of half of MUMPS.

I'm trying to suggest that MUMPS should be used to implement Bitcoin. I'm just observing that there will be tremendous amount of work expended to re-invent and re-implement one key feature of MUMPS: larger-than-core sparse hierarchical tree storage with all the expected ACID properties.

For those who are interested why the technology from circa 1975 outperforms all previous attempts at blockchain storage (both full and pruned) I have the following links:


It is a shame that I don't know of any open-source software that is compatible with the MIT license and other requirements specific to Bitcoin.
legendary
Activity: 905
Merit: 1011
Since @etotheipi is occupied for the next half-year and since I have a large interest in this proposal, I am offering my services to help make it happen. I have created a new thread with specific details of my proposal:

https://bitcointalksearch.org/topic/m.2135237
hero member
Activity: 544
Merit: 500
Fascinating. Logical. Influential...... Open source development such a privilege to watch.
legendary
Activity: 1896
Merit: 1353
I don't think so.  If you are at ABCD and it has only 3 children, "ABCDE" "ABCDP" and "ABCDZ", there's still only 3 seeks.  You seek for "ABCDA", and the iterator ends up at ABCDE (which is the first element equal-to-or-greater-than your seek value).  So you know that's the first child, and that there's no point in seeking for "ABCDB", "ABCDC" etc.  Then your next seek is "ABCDF", which puts you at "ABCDP".  Rinse and repeat.
oh indeed, I did not see that. thank you
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
My point was you don't need any pointers at all, and finding the children isn't actually that long since the database is efficient at these kinds of operations.  If you are node "ABCD" and want to go to pointer P, you don't need a pointer to know how to get there.  Just iter->Seek("ABCDP") and you'll end up at the first elemtent equal to or greater than it.  At the deeper levels, the iterators will efficiently seek directly in front of themselves, and may already have your next target in cache already.  

If it starts with "ABCD" you know you are still in a child of ABCD, and if not, you know you are in a parallel branch and can finish processing the "ABCD" node.  Yes, there may be a lot of seek operations, but with the built-in optimizations, there's a very good chance that they will be fast, and because it's a PATRICIA tree, you'll rarely be doing more than 6 such operations to get the branch updated.  

no, you need to know the list of children in order to compute the hash of a node.
if you don't store pointers at all, you'll need to perform 256 iter.seek() and iter.next() operations per node, only to know its list of children

I don't think so.  If you are at ABCD and it has only 3 children, "ABCDE" "ABCDP" and "ABCDZ", there's still only 3 seeks.  You seek for "ABCDA", and the iterator ends up at ABCDE (which is the first element equal-to-or-greater-than your seek value).  So you know that's the first child, and that there's no point in seeking for "ABCDB", "ABCDC" etc.  Then your next seek is "ABCDF", which puts you at "ABCDP".  Rinse and repeat.  

legendary
Activity: 1896
Merit: 1353
My point was you don't need any pointers at all, and finding the children isn't actually that long since the database is efficient at these kinds of operations.  If you are node "ABCD" and want to go to pointer P, you don't need a pointer to know how to get there.  Just iter->Seek("ABCDP") and you'll end up at the first elemtent equal to or greater than it.  At the deeper levels, the iterators will efficiently seek directly in front of themselves, and may already have your next target in cache already.  

If it starts with "ABCD" you know you are still in a child of ABCD, and if not, you know you are in a parallel branch and can finish processing the "ABCD" node.  Yes, there may be a lot of seek operations, but with the built-in optimizations, there's a very good chance that they will be fast, and because it's a PATRICIA tree, you'll rarely be doing more than 6 such operations to get the branch updated.  

no, you need to know the list of children in order to compute the hash of a node.
if you don't store pointers at all, you'll need to perform 256 iter.seek() and iter.next() operations per node, only to know its list of children

Quote
On the other hand, I haven't thought this through thoroughly.  I only know that it seems like you can avoid the pointers altogether which I was expecting to make up the bulk of the storage overhead.  i.e. each node currently will only hold a sum (8 bytes) and its own hash (32 bytes).  If you need the pointers, you could end up 256, 8-byte pointers per node in addition to it, which is actually quite heavy at the higher, denser levels.  

you only need 1 bit per pointer (true iff a child node exists), that's 32 bytes.
legendary
Activity: 1896
Merit: 1353
So, I would vote for:

{Pay2Hash160, Pay2PubKey65, Pay2PubKey33} all be serialized as 21 bytes:  0x00 + Hash160.  Any Pay2PubKey variants will be bundled under that single key.
{P2SH} scripts will be serialized as 21 bytes:  0x05 + Hash160{script}. 
{EverythingElse} Will simply be the raw script. 

One problem I see with this is that it doesn't make it clean to adopt new standard scripts, without reconstructing the database in the future...

Why not hash160(txout.scriptPubKey)? I had assumed from the beginning that's what we'd be doing. "Addresses" are a UI issue - the protocol should only concern itself with scripts.

+1
this is also what I have assumed
legendary
Activity: 905
Merit: 1011
So, I would vote for:

{Pay2Hash160, Pay2PubKey65, Pay2PubKey33} all be serialized as 21 bytes:  0x00 + Hash160.  Any Pay2PubKey variants will be bundled under that single key.
{P2SH} scripts will be serialized as 21 bytes:  0x05 + Hash160{script}. 
{EverythingElse} Will simply be the raw script. 

One problem I see with this is that it doesn't make it clean to adopt new standard scripts, without reconstructing the database in the future...

Why not hash160(txout.scriptPubKey)? I had assumed from the beginning that's what we'd be doing. "Addresses" are a UI issue - the protocol should only concern itself with scripts.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
I have started to experiment with this idea.
My goal is to add this hash tree to Electrum.

Each "numChildren" value (after the SumValue) can be exactly one byte, because you never have more than 256 ptrs, and each child pointer is also exactly 1 byte.  If you want to jump to a particular child, for instance, you are at node "11" and want to go the child at 3, you simply do iter->Seek("11"+"3") and it will skip "1122" and put the iterator right at "1137", which is the first database value >= "113".

Pointers can also be encoded as bits, using a fixed-size 32 bytes vector (assuming 256 pointers).
Of course variable-length storage would be more efficient, because most nodes will have sparse children, but I don't know if it is really worth the effort.
Indeed, keys will take up to 20 bytes, and node hashes will take 32 bytes anyway, so we're not adding an order of magnitude by using 32 bytes.

Quote
Furthermore, you might be able to get away without even any pointers!  You might just store the node/leaf hash and value, and know about children after the fact, simply by continuing your iteration.  You are at IterA, and IterB=IterA.Next().   You know that IterB is a child node of IterA because IterB.key().startswith(IterA.key()).   That's stupid simple.  

So, you know what level you're at simply by looking at Iter.size()
So, you know that you are a child because IterNext.key().startswith(IterPrev.key()).
If the previous check fails, you know you finished traversing that branch and you can update IterPrev.

Though, there may be something I'm missing that would still require you to store the pointers.  But it's still a lot better than storing 6-8 bytes per pointer, which was originally where I thought the bulk of the data was originally going to end up.

You can indeed do it without pointers, but iterating to find the children of a node can be very long.
And you will need to find the children of a node everytime you update its hash.

My point was you don't need any pointers at all, and finding the children isn't actually that long since the database is efficient at these kinds of operations.  If you are node "ABCD" and want to go to pointer P, you don't need a pointer to know how to get there.  Just iter->Seek("ABCDP") and you'll end up at the first elemtent equal to or greater than it.  At the deeper levels, the iterators will efficiently seek directly in front of themselves, and may already have your next target in cache already. 

If it starts with "ABCD" you know you are still in a child of ABCD, and if not, you know you are in a parallel branch and can finish processing the "ABCD" node.  Yes, there may be a lot of seek operations, but with the built-in optimizations, there's a very good chance that they will be fast, and because it's a PATRICIA tree, you'll rarely be doing more than 6 such operations to get the branch updated. 

On the other hand, I haven't thought this through thoroughly.  I only know that it seems like you can avoid the pointers altogether which I was expecting to make up the bulk of the storage overhead.  i.e. each node currently will only hold a sum (8 bytes) and its own hash (32 bytes).  If you need the pointers, you could end up 256, 8-byte pointers per node in addition to it, which is actually quite heavy at the higher, denser levels. 

After lots and lots of discussion and debate, I believe that the address index should be maintained as a trie-like structure.

It's possible to create a transaction that has no address at all. What is considered the address in this case?

There's a liitle room for negotation on this topic, but ultimately and "address" is a TxOut script.  In a totally naive world, your "addresses" would just be the exact serialization of the TxOut script -- so a 25-byte "address" for each standard, Pay2Hash160 script.  Or 35 or 67 for pay-to-public-key scripts.    23 bytes for a P2SH script.  And then anything that is non-standard would be simply serialized raw.

However, I don't like this, because a single address ends up with multiple equivalent representation.  Even though pay-to-public-key scripts are rare, there are addresses that use both (such as multi-use addresses that were used for mining and regular transactions).  Even though it's rare, you'd have to ask your peers for 2 different scripts per address (the Pay2Hash160 and PayToPubKey scripts).  I'd almost prefer making special cases for these addresses, given that they are so standard and fundamental to Bitcoin transactions.

So, I would vote for:

{Pay2Hash160, Pay2PubKey65, Pay2PubKey33} all be serialized as 21 bytes:  0x00 + Hash160.  Any Pay2PubKey variants will be bundled under that single key.
{P2SH} scripts will be serialized as 21 bytes:  0x05 + Hash160{script}. 
{EverythingElse} Will simply be the raw script. 

One problem I see with this is that it doesn't make it clean to adopt new standard scripts, without reconstructing the database in the future.  I suppose it wouldn't be the end of the world, but we also don't want to make an inflexible protocol decision.  This isn't just personal preference for storing address/scripts, it's actually describing the authenticated structure of the Reiner-tree.  So if we would be adding a new std script type, and we'd want a short form of it to store in the DB, we'd have to update the "protocol".  If this had been adopted already, that would be a hard fork.   If we just do raw scripts all around, this isn't really a problem, except that we may have to ask for extra branches to make sure we get all possible variants of a single public key.


@ ThomasV

I noticed you asked something about "SumValue" before.  I don't know if you got the question answered, but the idea was to recursively store the sum-of-value of each sub-branch, and have it authenticated along with the hashes.  Quite a few users, including gmaxwell (who was original only luke-warm on this whole idea), determined that was an extremely valuable addition to this spec to deal with miners who lie about their reward knowing that the network is made up almost entirely of lite-nodes who have no way to determine otherwise.  But those lite nodes know what the total coins in circulation should be, and thus would only have to look at the root-sum-value to determine if someone cheated. 

I don't know if I completely captured that concept.  I'm sure someone like gmaxwell or d'aniel can jump in and explain it better.  But it is an easy add-on to the original idea.  And also makes it possible to simply query your balance without downloading all the raw TxOuts (though, if you are using each address once, that doesn't actually save you a lot).

legendary
Activity: 2142
Merit: 1009
Newbie
After lots and lots of discussion and debate, I believe that the address index should be maintained as a trie-like structure.

It's possible to create a transaction that has no address at all. What is considered the address in this case?
legendary
Activity: 1896
Merit: 1353
I have started to experiment with this idea.
My goal is to add this hash tree to Electrum.

Each "numChildren" value (after the SumValue) can be exactly one byte, because you never have more than 256 ptrs, and each child pointer is also exactly 1 byte.  If you want to jump to a particular child, for instance, you are at node "11" and want to go the child at 3, you simply do iter->Seek("11"+"3") and it will skip "1122" and put the iterator right at "1137", which is the first database value >= "113".

Pointers can also be encoded as bits, using a fixed-size 32 bytes vector (assuming 256 pointers).
Of course variable-length storage would be more efficient, because most nodes will have sparse children, but I don't know if it is really worth the effort.
Indeed, keys will take up to 20 bytes, and node hashes will take 32 bytes anyway, so we're not adding an order of magnitude by using 32 bytes.


Quote
Furthermore, you might be able to get away without even any pointers!  You might just store the node/leaf hash and value, and know about children after the fact, simply by continuing your iteration.  You are at IterA, and IterB=IterA.Next().   You know that IterB is a child node of IterA because IterB.key().startswith(IterA.key()).   That's stupid simple.  

So, you know what level you're at simply by looking at Iter.size()
So, you know that you are a child because IterNext.key().startswith(IterPrev.key()).
If the previous check fails, you know you finished traversing that branch and you can update IterPrev.

Though, there may be something I'm missing that would still require you to store the pointers.  But it's still a lot better than storing 6-8 bytes per pointer, which was originally where I thought the bulk of the data was originally going to end up.

You can indeed do it without pointers, but iterating to find the children of a node can be very long.
And you will need to find the children of a node everytime you update its hash.


hero member
Activity: 507
Merit: 500
legendary
Activity: 1232
Merit: 1083
Could anyone let us know the current progress in implementation of this idea?

I was thinking of looking into it "soon", but I have lots of other stuff going on.  My though are that it should be a distributed verification system and distributed hash table. 

The official client is going down the path of not allowing random transaction lookup, so the DHT is needed to support that.

Each node would randomly select transactions to verify.  You might set your node to verify only 1% of all transactions (p = 0.01).  When you get a new block with N transactions, you would attempt to verify only p * N of them (though it would be random, so you might verify more or less than that).

My thoughts are that all nodes would verify all branches that they are aware of.  Orphans within say 10k of the end of the chain would be verified.

It just marks blocks as valid and invalid.

The distributed hash table needs to store all transactions and also all internal nodes in all trees that are in use.  It is hash -> children nodes.

When you connect to a node, you tell it what you think is the end of the main chain.  You also give the last 10 blocks and nodes along the way.

For each location, you give

- hash of main chain header
- hash of UTXO root root

You would also be monitoring the main chain so you can find the chain with the longest POW.

You can then try find the fork points (since the power of 2 increase is relative to the start, all nodes would give the same values).  POW disagreements can be fixed by proving to one of the nodes that they aren't on the longest branch.

This should leave all nodes either agreeing or disagreeing based purely on validation.  You ask both nodes to prove the other node's block is invalid.  If a node won't switch to the longest POW branch, then you ask it to prove why.

This means that all nodes should keep a record of valid block headers (i.e. ones that meet POW) and the proof that they are actually invalid blocks.  This shouldn't happen that often, since creating an valid block header for invalid blocks is expensive.

This means that it doesn't even need to be an alt chain.  It is just a system where proof about invalid blocks is stored and shared.
Pages:
Jump to: