Ultimate blockchain compression w/ trust-free lite nodes - page 17.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: casascius on June 21, 2012, 04:47:25 PM

If you mean a factor of 2016, how about 21. (21x96=2016) That's also a clean round factor of 21 million, as well as 210000 blocks between reward changes.

I think it is too early to make this decision. I just wanted to stress that the heaviest housekeeping updates should be phase-shifted with respect to the difficulty retarget. In other words the blocks just before and just after the retarget should involve only light housekeeping.

I haven't seen anyone doing any serious game-theoretic analysis of the possible splitting attacks on the global Bitcoin network during the retarget, but I just want to avoid creating additional headaches resulting from batch updates.

casascius

vip

Activity: 1386

Merit: 1140

The Casascius 1oz 10BTC Silver Round (w/ Gold B)

Quote from: 2112 on June 21, 2012, 04:44:50 PM

Quote from: casascius on June 20, 2012, 06:11:56 PM

I am the one who originally suggested the 100 block interval... but I don't think I said that updating the meta tree only every 100 blocks is what should be done.

Also I urge to seriously consider batch updating the primary storage structure. And keep the recently-heard-of updates in a separate storage area. This probably should be somehow similar to the generational garbage collection concept.

I would also urge to avoid using 100 but choose a divisor of 2016.

If you mean a factor of 2016, how about 21. (21x96=2016) That's also a clean round factor of 21 million, as well as 210000 blocks between reward changes.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: casascius on June 20, 2012, 06:11:56 PM

I am the one who originally suggested the 100 block interval... but I don't think I said that updating the meta tree only every 100 blocks is what should be done.

Also I urge to seriously consider batch updating the primary storage structure. And keep the recently-heard-of updates in a separate storage area. This probably should be somehow similar to the generational garbage collection concept.

I would also urge to avoid using 100 but choose a divisor of 2016.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: vuce on June 21, 2012, 03:47:07 PM

AVL tree is the mother of balanced binary trees. They have the smallest "worst case height", so the fastest query, but a bit slower insert/delete than red-black trees.

I just wanted to point out that query speed is pretty much immaterial. All it matters is the update complexity.

Integrating over the world population of Bitcoin clients the probability of any particular key being queried is almost 0, but the probability of any particular key being inserted/deleted is almost 1. This is pretty much the exact opposite of the assumptions made in all classic information storage and retrieval texts.

If you come up with a really good storage tree with low overhead for insert/delete but bad query performance you can easily fix it by maintaining a secondary index structure that facilitates fast query for keys that are locally interesting. That secondary structure may be different for each individual client and be dependent on the local querying behavior.

casascius

vip

Activity: 1386

Merit: 1140

The Casascius 1oz 10BTC Silver Round (w/ Gold B)

I would consider height to be a worthy thing to control, not so much for query speed, but because the nodes from leaf to ancestor might have to be transmitted across the network in response to a query.

Also as we discuss these tree types, I want to make sure we are not straying far from the definition of a merkle tree, to maintain the desirable property of being able to prove that key x is or is not in the tree by providing all of the hashes necessary to climb to the root. All of these nifty tree types that put data in the branches rather than the leaf nodes likely may not retain that important property. I read about red black trees on Wikipedia and notice the data does not go in the leaf nodes and cannot clearly see how I could clip part of that tree and hand it to someone and they be able to trust my clipping via a merkle root.

vuce

sr. member

Activity: 476

Merit: 250

Quote from: etotheipi on June 21, 2012, 03:42:30 PM

@vuce
The structure of a trie has no dependence on insert order. Given a set of data, there is only one trie that can hold it. The same goes for Patricia tries (which are level-compressed tries). And given that its query, insert and delete times are based strictly on key size (which will be bounded as above), there are no balance issues at all: it always takes exactly 32 "hops" to get from the root to the leaf you want, regardless of whether you are querying, inserting or deleting. So given fixed key-length, all those operations are actually O(1).

I was citing the patricia trie wiki, where it's pretty obvious that new inserts are inserted as leaves of the tree, therefore making them insert order dependent. If you would direct me to a better explanation I would appreciate it.

nevermind, I misunderstood how it works Embarrassed

FWIW, I agree, I think this might be the best choice.

Quote

What other data structures am I missing that could be considered? I know B-trees would be a good choice if we are going with insert-order-dependent structure: they are easy to keep balanced with fairly simple rules for balancing.

AVL tree is the mother of balanced binary trees. They have the smallest "worst case height", so the fastest query, but a bit slower insert/delete than red-black trees. They are also very easy to implement.

2-3-4 tree might also be worth considering. Don't know if it's insert-order-independent or not but by a quick look it might be. Or maybe plain 2-3 tree, that one has data in leaves only so it looks kind of like a merkle tree, but does have quite a bit of overhead.

etotheipi

legendary

Activity: 1428

Merit: 1093

Core Armory Developer

Quote from: maaku on June 21, 2012, 12:44:05 PM

@etothepi, what about non-standard transactions? Including IP, P2SH and future contract formats. Not all outputs can be reduced to an address. We've been speaking loosely about a tree of “addresses” but it would really have to be a tree of output scripts, so it's not going to be possible to limit search-string length for the prefix trie.

I was expecting that the hash of the TxOut script would be used so that all nodes are exactly 32-bytes. You could argue that's exactly what most TxOut scripts already are: hashes of longer data fields (such as using hash160s in place of public keys), but you have to make sure the search key is strictly bounded in size if you're using a trie of some sort.

@vuce
The structure of a trie has no dependence on insert order. Given a set of data, there is only one trie that can hold it. The same goes for Patricia tries (which are level-compressed tries). And given that its query, insert and delete times are based strictly on key size (which will be bounded as above), there are no balance issues at all: it always takes exactly 32 "hops" to get from the root to the leaf you want, regardless of whether you are querying, inserting or deleting. So given fixed key-length, all those operations are actually O(1).

On the other hand, I was hoping for a structure that wasn't too complicated, and both RB trees and Patricia tries have complicated implementations (even though the concepts behind them are fairly simple). But if we're going to have to go with something complicated, anyway (to limit worst-case speed and time performance), then I'd have to vote for Patricia trie or variant. Not only is it O(1)... someone brought up the very good point that updates to the tree can mostly be parallelized. That sounds like another good property of a tree that's going to have very high update rates...

I just gotta spend some time to figure out the space overhead for storing TxOuts. If it's going to triple the overall disk space compared to other structures, it might be worth using one of the insert-order dependent trees.

What other data structures am I missing that could be considered? I know B-trees would be a good choice if we are going with insert-order-dependent structure: they are easy to keep balanced with fairly simple rules for balancing.

CoinLab

sr. member

Activity: 270

Merit: 250

1CoinLabF5Avpp5kor41ngn7prTFMMHFVc

This is a very interesting idea. Excited to see how it develops.

vuce

sr. member

Activity: 476

Merit: 250

Quote from: etotheipi on June 21, 2012, 12:18:11 PM

I'm voting for level-compressed trie structures (so, a variant of a patricia trees) which have no balance issues at all, are insert-order-independent, and O(1) query/insert/delete.

Quote

To insert a string, we search the trie until we can make no further progress. At this point we either add a new outgoing edge labeled with all remaining characters in the input string, or if there is already an outgoing edge sharing a prefix with the remaining input string, we split it into two edges (the first labeled with the common prefix) and proceed.

So new insertions go to the leaves of the tree, I think this would make it insert dependent - just like any other tree. I'd suggest avl instead of r-b, since it has lower worst height.

casascius

vip

Activity: 1386

Merit: 1140

The Casascius 1oz 10BTC Silver Round (w/ Gold B)

Quote from: maaku on June 21, 2012, 12:44:05 PM

@etothepi, what about non-standard transactions? Including IP, P2SH and future contract formats. Not all outputs can be reduced to an address. We've been speaking loosely about a tree of “addresses” but it would really have to be a tree of output scripts, so it's not going to be possible to limit search-string length for the prefix trie.

I would think that all that matters is there be a deterministic index that can be used to look it up.

P2SH has a hash. IP, to the best of my knowledge, isn't a kind of transaction, but is just a way to produce a pubkey-based transaction (from which an address/hash can be derived). Transaction formats yet to be invented could easily stipulate some way of being found, if a simple default of "first hash in the script, or hash of [first constant | all concatenated constants] in the script bigger than X bits, whichever comes first" didn't solve most or all cases with a single broad stroke. (For example, if such a default didn't make sense for a future transaction type, that future transaction type could contain a field that says "My Search Key is X".)

maaku

legendary

Activity: 905

Merit: 1014

@etothepi, what about non-standard transactions? Including IP, P2SH and future contract formats. Not all outputs can be reduced to an address. We've been speaking loosely about a tree of “addresses” but it would really have to be a tree of output scripts, so it's not going to be possible to limit search-string length for the prefix trie.

casascius

vip

Activity: 1386

Merit: 1140

The Casascius 1oz 10BTC Silver Round (w/ Gold B)

Quote from: etotheipi on June 21, 2012, 12:18:11 PM

This is a topic I've been debating with folks on IRC the past couple days. It's clear that most tree data strcutures have most of the properties we want. Red-Black trees work great, but I don't like that the specific underlying structure (and thus root hash) depends on the specific order/history of insertions and deletions. And assumes that every red-black implementation uses the rebalancing algorithm. I have been given compelling reasons why this shouldn't be a problem, but I am personally not convinced yet. Though I agree that it is probably an acceptable solution.

Regardless of the tree structure chosen, why not rebuild it every 100 blocks, just for the sole purpose of having a periodic way of deterministically regenerating the tree, and to avoid mandating a continuous dependency on all prior versions of the meta tree in order to be certain that one has the "correct" permutation of the meta tree?

etotheipi

legendary

Activity: 1428

Merit: 1093

Core Armory Developer

Quote from: casascius on June 21, 2012, 11:16:19 AM

Quote from: jojkaart on June 21, 2012, 11:13:07 AM

Not quite true. For example, the red-black tree algorithm guarantees worst case operation of O(log N). Most branches, although they will move, will keep their hashes as they were, no need to recalculate.

- Joel

The red-black tree is a concept I don't yet understand. But if choosing this type of structure brings the benefit of O(log N) updates without introducing any negatives, then I'm all for it, and of course the periodic rebalance would become unnecessary.

This is a topic I've been debating with folks on IRC the past couple days. It's clear that most tree data strcutures have most of the properties we want. Red-Black trees work great, but I don't like that the specific underlying structure (and thus root hash) depends on the specific order/history of insertions and deletions. And assumes that every red-black implementation uses the rebalancing algorithm. I have been given compelling reasons why this shouldn't be a problem, but I am personally not convinced yet. Though I agree that it is probably an acceptable solution.

I'm voting for level-compressed trie structures (so, a variant of a patricia trees) which have no balance issues at all, are insert-order-independent, and O(1) query/insert/delete. The problem is they can have a lot of storage overhead per tree element. I haven't done the calculation to know for sure just how bad it is.

Once I get out my next version of Armory, I will be diving into this a bit more, hopefully creating a proposal with much more specifics about tree structure, and the CONOPs (concept of operations) of the meta-chain.

EnergyVampire

full member

Activity: 210

Merit: 100

~~Subscribing~~

Added to Watchlist: https://bitcointalksearch.org/topic/watchlist-90136

casascius

vip

Activity: 1386

Merit: 1140

The Casascius 1oz 10BTC Silver Round (w/ Gold B)

Quote from: jojkaart on June 21, 2012, 11:13:07 AM

Not quite true. For example, the red-black tree algorithm guarantees worst case operation of O(log N). Most branches, although they will move, will keep their hashes as they were, no need to recalculate.

- Joel

The red-black tree is a concept I don't yet understand. But if choosing this type of structure brings the benefit of O(log N) updates without introducing any negatives, then I'm all for it, and of course the periodic rebalance would become unnecessary.

jojkaart

member

Activity: 97

Merit: 10

Quote from: casascius on June 21, 2012, 11:07:48 AM

My understanding is that removing leaf nodes from a sorted Merkle tree while maintaining the constraint that the tree remain sorted and balanced has the potential to cause the hash of every node to be recalculated. Imagine going from a tree that has 513 nodes to one that has 512 nodes. The tree will lose a whole rank. That's an extreme case, but not far from the typical case: if you remove a leaf out of the middle and don't replace it with a placeholder, all the leaf nodes will shift left by one position to maintain the sort and balance constraints, and every parent of any node that has shifted will be recalculated. The closer the removal is to the left side of the tree, the greater proportion of the tree must be recalc'd. A recalc of a tree in the hundreds of MB or in the GB's for every incoming Bitcoin transaction would be overbearingly expensive. All transactions result in the spending, and therefore, a deletion of at least one leaf node, so this kind of update would be CPU-intensive for every roll of Satoshi's dice. So my idea - to keep the tree view consistent and keep the updating to log(N) would be to only balance the tree on a predetermined interval, and at any point between, use placeholders and allow leaf nodes to become branches to conserve updating resources.

Not quite true. For example, the red-black tree algorithm guarantees worst case operation of O(log N). Most branches, although they will move, will keep their hashes as they were, no need to recalculate.

- Joel

casascius

vip

Activity: 1386

Merit: 1140

The Casascius 1oz 10BTC Silver Round (w/ Gold B)

Quote from: maaku on June 21, 2012, 03:00:53 AM

Quote from: casascius on June 20, 2012, 06:11:56 PM

I am the one who originally suggested the 100 block interval... but I don't think I said that updating the meta tree only every 100 blocks is what should be done.

Rather, the meta tree should be rebalanced every 100 blocks, and between then, nodes should be added and deleted using methodology that avoids (procrastinates) having to recalculate the hashes for most or all the nodes in the tree any time there was a change. Otherwise, every incoming transaction will have a huge CPU time burden that's not sustainable. Rebalancing the tree is much like rebuilding a database index.

I'm not sure I follow. Updating any tree (balanced or not) is a constant-time operation. Updating any Merkle-tree (balanced or not) is log(N), although you can save a little effort by marking updated nodes as dirty and only updating Merkle hashes at the end. Rebuilding a database is N*log(N), a different beast. Anyway that's beside the point. Updating a balanced Merkle tree shouldn't be any more complex, algorithmically, than leaving it unbalanced. Unless I'm missing something; am I?

My understanding is that removing leaf nodes from a sorted Merkle tree while maintaining the constraint that the tree remain sorted and balanced has the potential to cause the hash of every node to be recalculated. Imagine going from a tree that has 513 nodes to one that has 512 nodes. The tree will lose a whole rank, and 100% of its hashes will change. That's an extreme case, but not far from the typical case: if you remove a leaf out of the middle and don't replace it with a placeholder, all the leaf nodes will shift left by one position to maintain the sort and balance constraints, and every parent of any node that has shifted will be recalculated.

The closer the removal is to the left side of the tree, the greater proportion of the tree must be recalc'd. A recalc of a tree in the hundreds of MB or in the GB's for every incoming Bitcoin transaction would be unsustainably and unscalably expensive. All transactions result in the spending, and therefore, a deletion of at least one leaf node, so this kind of update would be CPU-intensive for every user upon every roll of Satoshi's dice. So my idea - to keep the tree view consistent and keep the updating to log(N) would be to only balance the tree on a predetermined interval, and at any point between, use placeholders and allow leaf nodes to become branches (both of which - especially the latter - would make the tree no longer a proper Merkle tree) to conserve CPU resources during updates.

casascius

vip

Activity: 1386

Merit: 1140

The Casascius 1oz 10BTC Silver Round (w/ Gold B)

Quote from: tevirk on June 21, 2012, 01:10:25 AM

Sorry to be slow, but I don't see the gain here. If a lightweight client is going to trust that a metablock that's been merged into the chain is truthful (because it's been built into a block), then it can just as reliably trust that a transaction that's in the chain a few blocks back is valid, because it's been built into a block. There's no need for it to keep anything. The only real advantage here seems to be that it saves miners from having to have a hard disk, and it seems like a lot of engineering to do that.

Quite possibly I'm missing something, in which case it would probably help for someone to step back and explain the aims and benefits.

That works if, as a lightweight node, you plan only on receiving funds that have a very small number of confirmations, which eliminates your view of the majority of bitcoins that exist. In USD terms, this would be like limiting yourself to only being able to accept crisp dollar bills that have never been handled more than once or twice. More likely than not, you're going to need to be able to receive funds from anybody, which will have been confirmed anywhere on the block chain between the genesis block and now. You either need the whole block chain to know whether a given incoming transaction is valid, or at least the digested tree of all unspent txouts for the entire block chain.

maaku

legendary

Activity: 905

Merit: 1014

Quote from: casascius on June 20, 2012, 06:11:56 PM

I am the one who originally suggested the 100 block interval... but I don't think I said that updating the meta tree only every 100 blocks is what should be done.

Rather, the meta tree should be rebalanced every 100 blocks, and between then, nodes should be added and deleted using methodology that avoids (procrastinates) having to recalculate the hashes for most or all the nodes in the tree any time there was a change. Otherwise, every incoming transaction will have a huge CPU time burden that's not sustainable. Rebalancing the tree is much like rebuilding a database index.

I'm not sure I follow. Updating any tree (balanced or not) is a constant-time operation. Updating any Merkle-tree (balanced or not) is log(N), although you can save a little effort by marking updated nodes as dirty and only updating Merkle hashes at the end. Rebuilding a database is N*log(N), a different beast. Anyway that's beside the point. Updating a balanced Merkle tree shouldn't be any more complex, algorithmically, than leaving it unbalanced. Unless I'm missing something; am I?

Quote from: tevirk on June 21, 2012, 01:10:25 AM

Sorry to be slow, but I don't see the gain here. If a lightweight client is going to trust that a metablock that's been merged into the chain is truthful (because it's been built into a block), then it can just as reliably trust that a transaction that's in the chain a few blocks back is valid, because it's been built into a block. There's no need for it to keep anything. The only real advantage here seems to be that it saves miners from having to have a hard disk, and it seems like a lot of engineering to do that.

Quite possibly I'm missing something, in which case it would probably help for someone to step back and explain the aims and benefits.

Wrong problem. It saves the lightweight client from having to download, verify, and keep track of any of the block chain at all, except for those parts the user cares about (their own unspent outputs, for example).

tevirk

newbie

Activity: 15

Merit: 0

Sorry to be slow, but I don't see the gain here. If a lightweight client is going to trust that a metablock that's been merged into the chain is truthful (because it's been built into a block), then it can just as reliably trust that a transaction that's in the chain a few blocks back is valid, because it's been built into a block. There's no need for it to keep anything. The only real advantage here seems to be that it saves miners from having to have a hard disk, and it seems like a lot of engineering to do that.

Quite possibly I'm missing something, in which case it would probably help for someone to step back and explain the aims and benefits.

Topic: Ultimate blockchain compression w/ trust-free lite nodes - page 17. (Read 87974 times)