Replacing old UTXOS with new one in place: an idea/suggested complexity reductio

Shymaa-Arafat

full member

Activity: 228

Merit: 156

Quote

They finally replied
Quote
he -> Hashes ever. All the hashing ever done.
.

.
Yesterday when I was checking if they answered my issue, I found this
https://github.com/mit-dci/utreexo/issues/276
I think, my understanding that could be wrong I don't mind reading other should, could be the real reason for the output "he" having that unexplained/not logical value in the run...

Line 84 in the function that keeps running forever (probably meaning till the end of the run, manually as u did)

84 curHeight++

This value is probably added to "height" in line 99 here

99 for ; height <= knownTipHeight && !stop; height++ {

"he" when u stop the run

Shymaa-Arafat

full member

Activity: 228

Merit: 156

Getting back to the main topic, I think I should add this in reply (like it is there)
.
Neglecting the effect on the proof size (which depends on the tree height)at least for now, I think replacement reduces the size/amount of "change" in proofs. When less branches r modified, less internal nodes change their concatenated hashes. This could be helpful, especially when u r caching some of them.
.
For example in the run results we were discussing, the numbers suggest an av of +9 (165251/180000) insertions, 6-7 deletions per block (surely this is a small test data), resulting in 2-3 increase in the no of UTXOs per block.
When inserting then deleting u change 15-17 branches along with the corresponding internal nodes hashes, while if u modified when possible u only mess with 9-10 branches.

Shymaa-Arafat

full member

Activity: 228

Merit: 156

They finally replied

Quote

he -> Hashes ever. All the hashing ever done.

.
Well, I closed the issue but now I wonder again from where came the difference than the no of insertions?
165251-1654444=807?
-Also, if it's obviously should be the same why calculate it to appear in the o/p?
Is it just double-checking, or the difference is due to address reuse?
-If it is address reuse, how do they check it?
Something still not clear

Shymaa-Arafat

full member

Activity: 228

Merit: 156

To give the complete picture here, the previous reply came as a response of me starting an issue #270 inquiring about the run results at the Utreexo site
https://github.com/mit-dci/utreexo/issues/270
Summary of the rest not said here:
-yes total is the total running time.
-Still not convinced why the forest height is too large, 1654444 is not O(log N) by any case?!
and the print code line doesn't clear it

Quote

c.ScanBlock(blocknproof.Block) if c.CurrentHeight%10000 == 0 { fmt.Printf("Block %d add %d del %d %s plus %.2f total %.2f \n", c.CurrentHeight, totalTXOAdded, totalDels, c.pollard.Stats(), plustime.Seconds(), time.Since(starttime).Seconds())

They r all part of the c.pollard.Stats() that I couldn't find its complete trace in the given code
.
To be honest and showing the complete picture, he also gave a "not preferring" reply to me proposing the same idea at their site
https://github.com/mit-dci/utreexo/discussions/268

Quote

While the design where additions and deletions happen at the same time would help a bit, hash savings would be limited and wouldn't help with the real bottleneck, the proof size.

.
The fact that the proof size depends on the tree height (as stated in their paper and known in the literature), the resulting height in their runs is encouraging me to pursue on with my idea (not have to be cooperated with their project). Also the transaction graph analysis; replacement is analogy to address reuse. So if they prevented address reuse because it violates anonymity connecting user data together, the same Locality of reference will probably be inherited in my Merkle Tree which decreases the proof size by having many repeated internal nodes in the batch(will be fetched only once)

kcalvinalvinn

newbie

Activity: 1

Merit: 0

Quote from: NotATether on May 14, 2021, 03:31:39 AM

Code:

got to height 170000
got to height 171000
got to height 172000
got to height 173000
got to height 174000
got to height 175000
got to height 176000
got to height 177000
got to height 178000
got to height 179000
Block 180000 add 1655251 del 1231778 pol nl 423473 roots 10 he 1654444 re 1259415 ow 0 
 plus 0.00 total 204.56

I believe "total" is just the running time, but I'm unable to make sense of the rest of the data. Perhaps "add" and "del" represent the number of addition and deletion operations, or UTXOs that have been added or deleted? And maybe "roots" is the number of utreexo roots.

Can you explain what I'm seeing? I think I'll also take a closer look at the way Core itself is constructing its transaction database.

Block 180000 -> The current height it's at
add -> all the added leaves (TXOs)
del -> all the deletions that happened (STXOs)
pol nl -> pollard numLeaves. All the current UTXOs
roots -> Utreexo roots
he -> Height. Height of the Utreexo tree
re -> RememberEver. The count of all the leaves remembered through caching. This is stale and we have a newer caching algorithm.

Quote

EDIT: Apparently it is using so much resources that it managed to crash my machine and force me to do a power cycle.

Yeah that's a known problem https://github.com/mit-dci/utreexo/issues/264. cmd/utreexoclient (CSN) likely has a memory leak. All the binaries in cmd/ aren't really maintained that much as there's more focus on the btcd fork with utreexo https://github.com/mit-dci/utcd. None of the binaries are really user friendly yet. We're still finalizing the accumulator design so not really focused on user-friendliness of the binaries.

Shymaa-Arafat

full member

Activity: 228

Merit: 156

Quote

Perhaps "add" and "del" represent the number of addition and deletion operations, or UTXOs that have been added or deleted? And maybe "roots" is the number of utreexo roots.

I'm not part of the Utreexo project, I can just guess that those r the no of deletions & insertions performed, the next is the difference ie the increase in the no of UTXOs, root 10 probably no of roots (Utreexo is a forest of trees, so means after this block they have 10 trees), maybe the next 2 numbers r the max, and min height of those 10 trees (If so seems degenerate?)
.
If those r really no of deletions & insertions, this means my replacement suggestion will make a significant performance improve, about 16/(16+12) meaning 3/7 of the work will be saved.

However, we r not sure about the resulting tree structure?gives better results in one tree or in forest?
ie the no of operations is reduced to almost half, but the single operation may become more complex if the number of concatenated hashes that need to be updated increases. Also, the storage.
I think my tree will resemble a transaction graph that's why I'm studying it.
.

Quote

EDIT: Apparently it is using so much resources that it managed to crash my machine and force me to do a power cycle.

Bolton Bailey said the drawback in his Trie implementation is that it runs in several days for it involves many disk accesses, while Utreexo runs in 1 day; but the resulting forest is about 133G I think, the Trie is 63G.
.
Thanks for the info & the trying effort.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

@Shymaa-Arafat

I just ran the utreexo server overnight. The driver program itself isn't the best as I had to stop Bitcoin Core before running the server and then delete an incomplete pollardFile, but the client is also up and running.

Now I am seeing output like this:

Code:

got to height 170000
got to height 171000
got to height 172000
got to height 173000
got to height 174000
got to height 175000
got to height 176000
got to height 177000
got to height 178000
got to height 179000
Block 180000 add 1655251 del 1231778 pol nl 423473 roots 10 he 1654444 re 1259415 ow 0 
 plus 0.00 total 204.56

I believe "total" is just the running time, but I'm unable to make sense of the rest of the data. Perhaps "add" and "del" represent the number of addition and deletion operations, or UTXOs that have been added or deleted? And maybe "roots" is the number of utreexo roots.

Can you explain what I'm seeing? I think I'll also take a closer look at the way Core itself is constructing its transaction database.

EDIT: Apparently it is using so much resources that it managed to crash my machine and force me to do a power cycle.

Shymaa-Arafat

full member

Activity: 228

Merit: 156

I should also mention that I found that it seems according to this article that Merkle Trees r officially implemented in the most traditional way!
https://codetrips.com/2020/08/02/modern-c-implementing-a-merkle-tree/

In there u can read ...

Quote

Rather obviously, this results in a highly unbalanced tree:
however, as the purpose of a MerkleTree is to hold data and ensure its validity against tampering or computation errors (and not, for example, search efficiency) this is a very convenient trade-off: re-balancing the tree after each (or even several) addition would require recomputing almost all of the nodes’ hashes (apart from the leaf nodes, which remain unchanged); given that this would likely require many cryptographic operations (which are notoriously CPU intensive) it would be a rather poor trade-off.

Shymaa-Arafat

full member

Activity: 228

Merit: 156

Quote

Hey, I'm one of the authors of the paper in the video that was linked. Thanks Shymaa for emailing me, it's cool that people are discussing these ideas

Thanks due to u for the paper first, and for the assuring reply this my first 6months in the crypto world.
And yes I like discussed ur paper a million times, this one in mixed Arabic/English
https://m.facebook.com/story.php?story_fbid=1409002919454152&id=100010333725264

Quote

In terms of this criterion of "keeping the transactions in order" from our paper, this technique doesn't necessarily do this.

-I don't know, maybe in most cases I like "trace" how a certain amount of currency got spent.

Quote

For this reason, I might expect the proofs to be larger, but ultimately I don't see why they should be larger than what UTREEXO currently does.

-The truth is I didn't exactly sense as welcoming reaction in the Utreexo gethub discussions?. I don't know should try again or find another "coding partner" who is familiar with the environment to help in the programming, testing,...etc.
.
Mile Thanks again, for the assuring reply 🙏
...
»»»One more thing, if this idea could trace how an amount of money got spent, do u think it could be used with some enhancements in introducing an analogy to NFT in Bitcoin? Huh

I mean maybe another object than UTXO called DUTXO for Distinguished that keeps track of the history of ownership in a tree (if it is to be divided or shared) Huh

FogOfChess

newbie

Activity: 1

Merit: 0

Hey, I'm one of the authors of the paper in the video that was linked. Thanks Shymaa for emailing me, it's cool that people are discussing these ideas.

In this particular formulation, I would say that some of the major strengths are low bookkeeping overhead (since this proposal doesn't seem to use a more complicated data structure than just a Merkle tree). In fact, it may be possible to do something like this in the UTREEXO codebase itself. UTREEXO is somewhat agnostic, as I think they've stated on their freenode channel, about how exactly the recombination takes place. It's possible for efficiency reasons they do actually try to do this "replacement" tactic - from my experience with their codebase, they keep their tree in a flat file where the location within the file indicates the location of the node in the tree. With this replacement, they would not have to do much movement of values within that database, so that seems like a benefit.

In terms of this criterion of "keeping the transactions in order" from our paper, this technique doesn't necessarily do this. For this reason, I might expect the proofs to be larger, but ultimately I don't see why they should be larger than what UTREEXO currently does.

Shymaa-Arafat

full member

Activity: 228

Merit: 156

Quote

think first we need to figure out how to keep a utreexo tree balanced, which I assume you want to go forward with, since if Merkle tree is used then every time a UTXO is replaced, more hashes need to be updated because it tries to keep everything under one root unlike utreexo.

Up till now I didn't modify the tree structure, just a simple BST.
»»»
To be accurate I think I should call it just Binary tree for now because the meaning of relative order could be tricky somehow.
««««
Utreexo on the other hand, uses a forest ( keeps a max of "log n" unconnected subtrees. So, yes maybe it will modify less hashes than my approach sometimes but not necessarily, since in the most frequent 1:2 case I do his 3steps in just 1 larger step.
.

Quote

AVL tree and red-black trees have similar performance except when you want to look up a specific UTXO (AVL trees are slightly faster). Also the extra bit doesn't take up much space overall.

As stated in the "suggesting Trie" paper, AVL Trees will probably involve more branch swaps with deletions which will(plus the extra complexity) distract more the original order we r trying to keep hoping for a better Locality of reference.
....

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: Shymaa-Arafat on May 04, 2021, 05:17:57 PM

Quote

Since most of the transactions on Bitcoin are 1-2, we can expect these children outputs to also split into two forming a chain which makes it no longer log(n) but rather linear.

.
Well, we r sure to have less number of deletions and no swaps in the tree branches (in addition to avoiding their overhead, keeping the locality of reference mentioned in the video)
However, I can't be quite sure about the resulting tree without trying on the old available data (benchmarks).
In the video paper they tried to benefit from locality of reference by reserving the original order of UTXOs in the tree/Trie, here I like keep the history of one coin or amount of crypto how it got spent in a subtree ( correlated hashes) will this lead to a better (or maybe worse) tree structure? Will this add a new piece of info (tracing money) that could be useful somehow (I don't know maybe tracing stolen money thru hacking, I don't want to rush things)?
.
I'm first like exploring did I miss something that makes this idea wrong or silly somehow?

I think first we need to figure out how to keep a utreexo tree balanced, which I assume you want to go forward with, since if Merkle tree is used then every time a UTXO is replaced, more hashes need to be updated because it tries to keep everything under one root unlike utreexo.

The video does mention that briefly (I don't recall seeing utreexo discussed here before) and proposes a few balancing tree structures - trie, red-black, and AVL.

IMO a trie doesn't make sense to represent a UTXO set with since those are intended for prefixes, and UTXO's have no ancestry.

AVL tree and red-black trees have similar performance except when you want to look up a specific UTXO (AVL trees are slightly faster). Also the extra bit doesn't take up much space overall.

The unfamiliarity is probably because not many people here have heard of or tried the utreexo method.

Shymaa-Arafat

full member

Activity: 228

Merit: 156

Quote

Since most of the transactions on Bitcoin are 1-2, we can expect these children outputs to also split into two forming a chain which makes it no longer log(n) but rather linear.

.
Well, we r sure to have less number of deletions and no swaps in the tree branches (in addition to avoiding their overhead, keeping the locality of reference mentioned in the video)
However, I can't be quite sure about the resulting tree without trying on the old available data (benchmarks).
In the video paper they tried to benefit from locality of reference by reserving the original order of UTXOs in the tree/Trie, here I like keep the history of one coin or amount of crypto how it got spent in a subtree ( correlated hashes) will this lead to a better (or maybe worse) tree structure? Will this add a new piece of info (tracing money) that could be useful somehow (I don't know maybe tracing stolen money thru hacking, I don't want to rush things)?
.
I'm first like exploring did I miss something that makes this idea wrong or silly somehow?

oryhp

member

Activity: 60

Merit: 89

Quote from: aliashraf on May 04, 2021, 07:21:49 AM

For transitions (caused by each txn in the block) that involve fewer inputs than outputs the spent nodes are replaced/updated with a Merkle Root while for transactions with equal number of inputs and outputs the nodes are updated to the new output and for the worst case when a transaction has more inputs than outputs the extra inputs should be arbitrarily chosen for deletion which is the most expensive operation for Merkle Trees.

I'm probably missing something. My current understanding is that a 1-2 transaction (1 input, 2 outputs) would replace the leaf of the input with a parent and two children, meaning you increase the depth of the tree by 1. Since most of the transactions on Bitcoin are 1-2, we can expect these children outputs to also split into two forming a chain which makes it no longer log(n) but rather linear.

Shymaa-Arafat

full member

Activity: 228

Merit: 156

Quote

That said, the problem is the applicability of this idea because with current bitcoin core implementation, we don't keep track of the blockchain state, using Merkle Trees, explicitly.
On the other hand, proposals for UTXO commitment (using a Merkle Root somewhere in the bloc or on a side-chain, whatever) are being abandoned for a while; it is a total unfortunate, BTW, because without such a commitment there will be no light client option available for bitcoin.

Quote

And in that video it's mentioned that the UTXO size is 5GB which is already greater than people's dbcache limits of 500MB, 2 and even 4 GB and I think that's a reason why Miller's UTXO merkle tree might be a better idea since it's size-agnostic

.
Answering both in short points:
1-The paper in the video was published in Mar21, the Utreexo project is still going, so the topic is currently under research focus not abandoned.

2-Miller proposal was implemented in 2014 https://github.com/amiller/redblackmerkle,
published as an ACM notice
Miller, A., Hicks, M., Katz, J., Shi, E.: Authenticated data structures, generically.
ACM SIGPLAN Notices 49(1), 411–423 (2014),

but if u watched the video u'll find out that it did not give the performance improvement that deserves the book keeping of a Red-Black tree data structure as compared to regular BST binary search tree.

3-Easier from the video to understand that the problem that led to the 3 proposals or researches of a RBT from Miller, Forest in Utreexo, Trie in the video paper is the overhead of continuous excessive deletion& insertion of UTXOs (the tree branches get modified which may lead to a degenerate tree with longer paths, each time all internal nodes that contain the leaf hash inside their concatenated hashes must be modified)
.
-What I'm saying if u noticed that every output comes from an input don't handle them separately and "distract" ur data structure, just modify the values in place with the hierarchy of the tree almost the same.
.
I hoped for a community familiar with the problem, to find out if I'm missing something. Is my seemingly obvious easy solution was thought of before and rejected for a reason?or it just didn't cross their minds? Huh

.
.
.
Last note, the memory requirements was the original motive of having Stateless nodes/clients ... nodes that do not keep the full Merkle Tree/Forest/Trie, they just the proof path (called witness) of the UTXOs they need to verify. Each insertion/deletion modify the parents and siblings hashes along the way, and thus may affect other proofs especially if the deletion involved a swap of tree branches (watch the video) and we wish to minimize that.

aliashraf

legendary

Activity: 1456

Merit: 1177

Always remember the cause!

Quote from: NotATether on May 04, 2021, 12:04:49 PM

Quote from: aliashraf on May 04, 2021, 07:21:49 AM

Anyway, I count this proposal as one of the many efforts for helping with UTXO commitment and I appreciate it. @NotATeher's memory requirement analysis of this proposal looks somehow irrelevant to me because for a client that is supposed to represent/compare the state of the blockchain it is committing to, by a Merkle Root (how else, it could be done?), it is absolutely necessary to maintain the data structure either in memory or on the disk, what OP is proposing an algorithm for its optimization here.

The memory requirement was not intended to be for the merkle tree, in fact I explicitly said at the end that merkle trees are independent of the UTXO size, it only applies to the traditional database which is the way the inputs are currently stored (the dbcache).

So, now it looks even more irrelevant. OP is concerned about Merkle Tree specific operations like addition and deletion of outputs. Without having the tree as the data structure it'd be impossible to update a node with the root of a tree.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: aliashraf on May 04, 2021, 07:21:49 AM

Anyway, I count this proposal as one of the many efforts for helping with UTXO commitment and I appreciate it. @NotATeher's memory requirement analysis of this proposal looks somehow irrelevant to me because for a client that is supposed to represent/compare the state of the blockchain it is committing to, by a Merkle Root (how else, it could be done?), it is absolutely necessary to maintain the data structure either in memory or on the disk, what OP is proposing an algorithm for its optimization here.

The memory requirement was not intended to be for the merkle tree, in fact I explicitly said at the end that merkle trees are independent of the UTXO size, it only applies to the traditional database which is the way the inputs are currently stored (the dbcache).

aliashraf

legendary

Activity: 1456

Merit: 1177

Always remember the cause!

Good idea as I get it:
Using a recursive Merkle Tree data structure to represent the state of the blockchain helps with transition from the current state once a block is received:
For transitions (caused by each txn in the block) that involve fewer inputs than outputs the spent nodes are replaced/updated with a Merkle Root while for transactions with equal number of inputs and outputs the nodes are updated to the new output and for the worst case when a transaction has more inputs than outputs the extra inputs should be arbitrarily chosen for deletion which is the most expensive operation for Merkle Trees.

That said, the problem is the applicability of this idea because with current bitcoin core implementation, we don't keep track of the blockchain state, using Merkle Trees, explicitly.
On the other hand, proposals for UTXO commitment (using a Merkle Root somewhere in the bloc or on a side-chain, whatever) are being abandoned for a while; it is a total unfortunate, BTW, because without such a commitment there will be no light client option available for bitcoin.

Anyway, I count this proposal as one of the many efforts for helping with UTXO commitment and I appreciate it. @NotATeher's memory requirement analysis of this proposal looks somehow irrelevant to me because for a client that is supposed to represent/compare the state of the blockchain it is committing to, by a Merkle Root (how else, it could be done?), it is absolutely necessary to maintain the data structure either in memory or on the disk, what OP is proposing an algorithm for its optimization here.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

OK now I think I get it, this is an optimization in the structures used for verification.

From that linked thread it says this:

Quote

This proposal calls for a single additional hash to be placed in each block: the root hash of the UTXO Merkle tree. As such, it belongs in the HardFork Wishlist. However, it could also be included in an (optionally merge-mined) alt-chain. This will require additional computation for validating a transaction - the addition will be a constant factor, since each processed txinput/txoutput already requires (even in UltraPrune), an O(log N) update to a BTree table of the UTXOs. For now, the proposal simply describes a stand-alone data structure that can be derived from existing blockchain data.

As far as I understand about that, there is already a root transaction hash inside each block header, but we also want a root hash for the UTXO merkle tree so that we can dispense with the current in-memory database of coins. This will save a lot of and maybe even dispense with the need of the dbcache all together (if all it does is store UTXOs).

I think this is similar your proposal of replacing the entries (of the coins memory database) of spent inputs with their unspent inputs. I'm not sure how the current implementation validates an input but I assume it would be something like "if this input is already in the database then it's being spent again".

I guess trying to change it to something like "if an input is NOT in the database (except when inserting UTXOs to said database) that means it has been replaced so it's being spent again" could work except the problem is the dbcache is only a limited size so if we run out of memory and have to off-load unspent inputs to disk, it might accidentally classify those as being spent again.

And in that video it's mentioned that the UTXO size is 5GB which is already greater than people's dbcache limits of 500MB, 2 and even 4 GB and I think that's a reason why Miller's UTXO merkle tree might be a better idea since it's size-agnostic.

Shymaa-Arafat

full member

Activity: 228

Merit: 156

No, I meant the Merkle Tree/forest or whatever data structure used for Stateless Clients
(like in the Utreexo project and other efforts like
https://youtu.be/HEKtDILPeaI
and even Miller's Red-Black trees idea that started here very long ago
https://bitcointalksearch.org/topic/storing-utxos-in-a-balanced-merkle-tree-zero-trust-nodes-with-o1-storage-101734)
.
Thanks for posting the images
& Let me emphasize that the orange circle is accidentally over "0ca81e", it just means the original parent node of the 1st supposedly deleted leaf in traditional methods.

Topic: Replacing old UTXOS with new one in place: an idea/suggested complexity reductio (Read 529 times)