Author

Topic: Research Proposal to classify UTXOs into different groups (Read 478 times)

full member
Activity: 228
Merit: 156
That seems like a lot of work to save 0.5G of RAM... given that even single board computers like Raspberry Pi's are coming with 8Gig now... and most desktops/laptops are spec'd with 8.
I understand u r more familiar with implementation aspects than me, but if this is the case why projects like Utreexo & accumulator,... say the reverse, and r widely accepted with even more grants???
Theoriticaly, the idea could be shifted to cache Vs RAM, although I don't think there's caches as large as 3.5G; the improvement here is with cache fetching from a smaller set the Hit ratio is expected to increase.

I understand you're trying to find resource savings... but what happens when someone tries to spend one of these UTXOs that has been taken out of the main set because it was deemed unspendable? Huh
He will simply get it from disk where they're stored in a probably similar smaller data structure. In fact if the search time is the dominant factor not the space here, u can keep all in main memory just partitioned

How do you even determine they are unspendable in the first place?
I'm discussing an abstract idea here, if most statistical sites r able to do that continuously online then it is feasible.
Yet, there r known guidelines like dust values and publicly advertised burn addresses for example

Any node that received a transaction with a "missing" UTXO would then need to dig through these other sets to double check that it isn't included there. Which seems like a potential DoS vector... ie. create and repeatedly send a bunch of transactions that contain "missing" UTXOs, all nodes suddenly have to start repeatedly loading and scanning other sets to determine that the UTXOs really don't exist... and any benefit of not having them already loaded in memory is lost.
When they're needed (attack scenarios, or revealing a double spend for example), u bring them to main memory. In this case u lose the memory saving, but still benefit from the time saving of partitioning idea.
Or, wait a minute, u don't like binary search the UTXO value for example; u just follow the link right?
The proof/witness length (no of nodes) idea is only for stateless nodes.

Is this the reason for the less enthusiasm?the improvement is slighter for full nodes
-& Will it involve changing a built-in code, even in the most minimal way that's not accepted???
-I mean if it involves changing the IBD  process to store UTXOS in different sets it is not accepted or welcomed?
-Utreexo & other Accumulators just deal with hashes without changing the original set, just pointing to it.

-If this is the case, then I could keep the same rule, and the idea becomes more useful to stateless nodes and similar arrangements, when the partition is applied to the hashes.
HCP
legendary
Activity: 2086
Merit: 4363
That seems like a lot of work to save 0.5G of RAM... given that even single board computers like Raspberry Pi's are coming with 8Gig now... and most desktops/laptops are spec'd with 8.

I understand you're trying to find resource savings... but what happens when someone tries to spend one of these UTXOs that has been taken out of the main set because it was deemed unspendable? Huh

How do you even determine they are unspendable in the first place?

Any node that received a transaction with a "missing" UTXO would then need to dig through these other sets to double check that it isn't included there. Which seems like a potential DoS vector... ie. create and repeatedly send a bunch of transactions that contain "missing" UTXOs, all nodes suddenly have to start repeatedly loading and scanning other sets to determine that the UTXOs really don't exist... and any benefit of not having them already loaded in memory is lost.


-To be sure of my understanding, this can NOT be done as a block batch, because it does happen that an input is generated & spent at the same block, right?
-I know it's better to wait 6 blocks, but I understand it's not frobidden not to?
Correct. You can have a situation where UTXOs are generated and then spent in subsequent transactions, all in the same block.
full member
Activity: 228
Merit: 156
Let me ask again, since u r obviously running a full node:

- I understand the UTXOS set is currently~4G that u have to store in main memory, wouldn't it be better for u if u can like purn a portion of it to secondary storage (disk/hard drive)?
-Initially this could be done at IBD step, then maintained periodically by each block in a time to be tuned.
-The starting reduction size for burned is ~149k, non-standard~420.5k, dust ~ 6-9m ( if u have more accurate statistics pls tell me), coinbase I will try to deduce an estimation from 2019 curve (again tell me if u know an accurate number)

»»» All in all, we could estimate not less than 15% reduction of the main memory size ~ 0.5G
- Wouldn't this be better for u as a full node, or even as a miner?
.
-& By the way, we can save the if statement for the coinbase UTXO since it is always the 1st in the Block
.
.
.
.
2- I got the impression from previous answers at the Utreexo GitHub, or at least that's what appears if I'm not hacked by man-in-the-middle attack, that Storage is the main dominant factor they r trying to optimize much more than speed.
-Now I get the impression that speed matters to u, and probably to all miners; it is reflected in proof size for non full nodes ( how much data they have to retrieve to verify). However, talking about time complexity, just to explore ur practical opinion, would my first dropped on the way idea of Replacement or insert & delete in the same time be useful?
-To be sure of my understanding, this can NOT be done as a block batch, because it does happen that an input is generated & spent at the same block, right?
-I know it's better to wait 6 blocks, but I understand it's not frobidden not to?

legendary
Activity: 4424
Merit: 4794
and now you add in more buzzwords.. divide and concour
i get it splitting the datasets

but having all datasets all in single file all on harddrive is not going to save any harddrive ware and tare
it actually ends up costing more cpu resources and more file opening operations

the thing is while for instance putting the coinbase into the lost forest for the first 100 heights
your taking a coinbase from previous -100 height out of the lost forest and putting it into the young forest.
and also taking a coinbase from an even older height out the young forest to put into an old forest

thats 3x more operations happening to every coinbase before they are even spent. just to shuffle then around by age
and then when a coinbase is spent. you have to if statement its age to work out which forest it might be in. to then seek the forest to make sure its a utxo still
so thats like a dozen operations

rather than just get height. find height check if coinbase is still utxo... 3 operations

same goes for all UTXO .. newly confirmed young forest. later on check utxo at older age to put them into old forest
then when being spend check age to find out which forest to look in
....
it seems you spent too much effort trying to push buzzwords and metaphoric forests without running scenarios of utility and computation to see if it actually results in efficiency

many people done the same with blockchains. thinking all databases in the world should use blockchains. without realising some database structures dont need them and also would waste more time/resources to have them. but they still try to push them 'coz its blockchain'


think about it. othe parts of your separate datasets include a lost forest of burn addresses
to separate out the burn addresses to put utxo into a lost forest forever. you have to put every transaction of every block through a check procedure to see if its a burn address.
this adds like (if 10burn addresses) 30 cpu operations per tx(2utxo) per block just to figure out if the utxo is a burn address
1:  for 0 to X of txcount of new block
2:    for 0 to X of outputs of tx get address
3:         is it 1burnaddress1
..          ....
13:       is it 1anotherburn10
14:            if yes
15:               put in lost forest
16:            else
17:               put in new forest
18:            end
19:       end
20:  end

where by if each tx has say 2 utxo its done 1->13 twice and 14&15 or 16&17 twice so thats 30

but without your is burn check it would just add utxo to dataset . 1 operation
..
and thats without being 100% sure if it truly is a list of 10 burn addresses that you have manually listed. or just 10 vanity addresses that took alot of brute to make
..
so again without concentrating on making a post to mention a buzzword. can you specifically lay out the forests in the format of root tree branch leaf fruit formation
eg
blockheight_txid_output[X]_..

then using basic pseudocode of operations and if statements to count computations required
and run some scenarios, numbers, examples of computational saving/increase to actually run your idea

.. heck you havnt even made it clear what age the threshold of young/old would be. you linked some people saying 15ish months and then you said elsewhere younger threshold.

atleast put some parameters in and examples and some math. and not just buzzwords
(i can put examples and you can argue the examples. but thats because your not giving any defined thresholds, examples yourself.. as thats where the misunderstanding is)
full member
Activity: 228
Merit: 156
It's clear that u r not participate in my research with ur conventions, so this LAST clarifying is of absolutely no benefit for me except the teacher explaining mode inside me.

Quote
EG instead of
seek 3000 inputs from set.
 if all valid.
   delete 3000 records(spent inputs)
   add 6000 records(new unspent outputs)
end

your idea is
read 3000 and split them 2700 in 1 and 300 in other
seek 2700 record from A
 if all valid
    delete 2700
    add 5400
 end
seek 300 from B
  if valid
    delete 300
    add 600
 end

Well what u r describing is called Divide-and-Conquer approach, and it's a well known to give an almost guaranteed improvement in performance if u can solve ur larger problem as more smaller problems.
.
Let me describe it this way...
-u have a big library of like millions of books, some u already have read others not yet.
-u r arranging books in shelves according to topic or some other issue, but this is not our point here. You r going to choose a book to read everyday on the condition u haven't read it before(AKA Unspent)
-So u decided to mark all unread with a RED sticker, whenever u want a book u search for it on the RED ones only (arrange them in anyway u like, but most people agreed that a TREE is the better arrangement)
.
-What I'm telling u that observations tell that some of those unread u r probably not going to read them ever for some reason boring for example ( those similar to burned UTXOs), So why don't u color their sticker with a different color, say GREEN, instead of popping into them unnecessarily while u r searching for everyday book?

-Also, some u may read only in summer vacation while u have longer read hours, so why not color them BLUE? This way in summer u will search only in this small blue subset, in rest of days u will remove unnecessary search because u already have a large amount to search in & it's better to remove the unnecessary books that u r not going to read in these busy days?
.....
-And so on, if u can divide the RED more to regular day reading & end reading it would be better (less search time corresponds to less proof size in UTXOs)
legendary
Activity: 4424
Merit: 4794
very true
but it seems you are missing its fundemental usage. and instead deciding you think you found something new that should be used everywhere. much like how many are overusing 'blockchain' in places that dont need it.

for instance merkle trees its the same analogy as 'ancestor' like grandparent, parent, child. is like root, tree, branch

how ever when you say you want to 'tree' and 'forest' all the burn addresses
however in reality. there is no relationship of all the burn addresses.
you would have to add alot of blacklisting of random addresses manually into code to then have them rooted to then tree their utxo... but then you are making long lists and still not being 100% sure if the addresses you manually added are truly burn addresses and not just lucky vanity addresses

you also say you want to "tree" transaction hashs
sorry but transaction hashes have no relationship to other transactions direct

yes blocks have relationships with the transactions within blocks.
but a tx in one block has no relationship with transaction in another block. because once spent. from one block to become a utxo in another block. that relationship is then cut. (the previous utxo is spent thus no longer in a utxo set, thus no tie)
transactions within a block have no direct relationship with each other. but do have a relationship with the parent block.
thus your not 'treeing' a transaction hash. but are 'treeing' a blockheight

...
anyway
getting to the point having your 2 forests in one file. becomes a wasted efficiency as having to open and read the file. is still open and read the file. however truly separating them into for instance ram vs hard drive. might have some efficency gain.
but now we circle back to the efficiency loss of identifying and then splitting blocks of transactions into 2 sets

EG instead of
seek 3000 inputs from set.
 if all valid.
   delete 3000 records(spent inputs)
   add 6000 records(new unspent outputs)
end

your idea is
read 3000 and split them 2700 in 1 and 300 in other
seek 2700 record from A
 if all valid
    delete 2700
    add 5400
 end
seek 300 from B
  if valid
    delete 300
    add 600
 end
seek current block minus 80k(100 in your case)
 move any utxo found at threshold from one set to other

see how many more operations at the most leanest. and thats without your manually added blacklists. nor any other groupings

..
but anyway
organising utxo in relationships of
blockheight
                |\_tx1
                |      |\_output1
                |       \_output2
                |
                |\_tx2
                |      |\_output1
                |       \_output2

works better for many reasons than your
"only their hashes r linked (connected) together in some manner using additional data structure usually called Merkle Tree"
"txhash"
           |\_address
           |\_address

for instance
having a blocks coinbase separate from the same blocks transactions. is not a tree for the coinbase
because the coinbase has no relationship to anything. so there is no tree

trying to form relationships due to some random social analysis of spending habit, requires more cpu resources to create these new imagined relationships. like the age threshold. or the burn address reference

the amount of cpu computations needed then outweigh any gains from having grouped datasets

in short. you appear to be wanting to make merkle trees for the sake of thinking merkle trees should be in all datasets. for the sake of WOW merkle trees are cool
much like many people before you think all databases should be blockchains

..
unless you can express the relationship of the data. in a way that can show clear usage that does not require huge CPU computation to form these relationships/adoptions/divorces(replanting/seeding/deforesting forests)
then please dont just throw around the word tree's for the sake of it

EG organising by blockheight. is easy to see the relationships. and clearly doesnt need much computation to group the child(branches) together to the parent(tree)to the gransparent(root)
full member
Activity: 228
Merit: 156
if you stop concerning yourself with buzzwords such as trees and forests..
.
Merkle "Tree" is a basic & mandatory keyword in this research area
If this is ur opinion Sir then I do not think this discussion will be ever productive, don't waste ur time reading it

legendary
Activity: 4424
Merit: 4794
if you stop concerning yourself with buzzwords such as trees and forests.. and actually speak using normal words like datasets, relationships, linked together

then it might make things clearer

because although you mention a single database file. you then tangent to say separate sets.
the logical assumption then becomes when you mention not having to look at the old forest (old coin set) than these 2 groups should be treated as completely separate
where logically one is viewed constantly and the other is not.

it seems what you mean by coinbase stays longer. i presume unlike the people you linked in your other idea's you are prefering to have a 100 block group of trees and everything else in another forest set block0-687k

which is not going to cause much efficiency gain. because the <1% is still in your same database. thus taking just about as much time as if it was in the 1% set or the 99% set to be read
(it takes same operational speed to find tree root: block 687703 as it does to find tree root: block10)

i tried to skip a few steps ahead and predict extra efficiencies. but hey seems ill have to step backwards again. and mention some other flaws.

flaws
1. all the data in one database on hard drive (hard drive slower then ram)
2. data sets of 1% 99% not going to do much (both hard drive so no gain really)
3. extra code needed to dig up the tree and replant which is more wasted resource
4. a transaction with new utxo in block 687700 has NO relationship with utxo in block687701
5. a 'burn address' has no relationship with another 'burn address'

so your buzzwording of trees for utxo hash linking are meaningless unless you are talking about having trees where the tree root is a block number and branches as txid and leaves are the spents(inputs) and fruit is the new utxo(outputs)
where by your then pruning the leaves and fruit all the time

because you cant really 'tree' together uxto address in 1 tx to another utxo address in another tx
for instance the 10 main burn addresses have absolutely no relationship with each other
there is no tree. no taint no pattern to link them..
no clode can identify them meaning you have to manually select them and tag them manually as burn addresses
yes you can tag/blacklist them as burn utxo's to ignore and never filter for. but you seem to be wanting to plant tree's everywhere by just using tree buzzwords without understanding it or explaining it.

its like you learned the new word of the month blockchain. then learned merkle tree and now you want everything to have a blockchain or merkle tree in it even if not needed. just for buzzwords sake of looking smart

anys ways getting back to the point .
having the young forest of say your 100 blocks(coinbase age cant spends)
you are not saving any resources because when a coinbase is 100confirms. you then have to move it from your old forest to you young forest because its now spendable

basically if the newest blockheight now is 687704 although your not putting that blocks coinbase in the young forrest as it cant be spent now.
what you are doing is
putting blocks 687704 in the old forest as not spendable
putting blocks 687604 into the young forest as now it can be spent

but soon that 687604 if not spent will age out and be deemed inactive and be put back into the old forest again

all you are doing is wasting more resource moving and replanting trees

i tried to be more reasonable with a 18month threshold for inactivity. and showed you the results of any efficiency gains/losses(64%/36%)
and i skipped ahead few steps by imagining extra efficiencies like not using the hard drive as much for the new forest by having the 2 forests separate. one ram stored one harddrive stored
because just having 2 sets in one database does not really change much compared to having one hard drive vs one ram

but even so. after trying to add much efficiency. your still losing that efficiency by the extra operations of replanting the tree's between the forests and learning the ages and other stuff..
whether in one database on hard drive for 2 sets. or ram&hard drive for 2 sets. the very factor of having 2 sets works against you when swapping in and out trees between the sets

EG if utxo set has 85million utxo's
splitting the sets = more operations to check them and plant them into correct set
but the hard drive file then ends up still being 85mill utxo. but now slightly more bloated and more cpu intensive to read
full member
Activity: 228
Merit: 156
Quote
I understand current idea of this topic is 2 datasets.
some call them tree or forrest or other buzzwords.. but they are datasets(sets of data)
.
There is really some misunderstanding here, I'm sorry I have a lot of headache at the moment to read all what u wrote, if what I will write now didn't clarify it, I will read all of it & trace where the misunderstanding is...
.
The UTXOs reside in their one database as usual, only their hashes r linked (connected) together in some manner using additional data structure usually called Merkle Tree
-Ethereum for example use a little different data structure (way to connect them) called Tries (as opposed to tree)
-Utreexo project keep newer ones in small trees before joining the Big tree (what is called a forest of trees) hoping they will go away before the join step(got spent while they r in the small tree with less height)
-In this they use the educated guess (called heuristic) that those who stayed long will probably stay even longer, those who just came will probably quickly go.
.
-What I'm saying is not just the timing old or new, we could use a more better guess (coinbase UTXOs stay longer)or even things we r almost sure about (burned UTXOs r going to stay forever) and "link" or "connect" the hashes of those together (not physically store their values) in separate trees. This way they the faster ones connection network u may say will be less complicated.
.
I hope this is better
legendary
Activity: 4424
Merit: 4794
i understand current idea of this topic is 2 datasets.
some call them tree or forrest or other buzzwords.. but they are datasets(sets of data)

i was predicting that within a few weeks you would come up with a new idea of a 3rd.set where you split the data up further (separating blockaddress from the value, txid, etc)

but anyway sticking with current idea.

if you split the utxo set by just age.. , say either:
18month age. thats (at current set) 64% >18month.. 36%<18month
12 month age. thats (at current set) 69% >12month.. 31%<12month
6months age. thats (at current set) 78% >6month.. 22%<6month
1day age, thats (at current set(99%>1day..1%<1day)


so while some of your links shown others people looking at more then a 12 month area. i assumed 18month

so lets go with this 36% ram utility/64% ram efficiency model1
so that there is less hard drive ware

lets assume transactions are 1in 2 out
lets assume 10% old coin spend 2
lets assume 3k tx per block(1 in spent 2 out unspent(3000in 6000 out))
..

just the very fact that you have to run extra operations: just to separate them into sets(forests)
(im dumbing down the operations to pseudocode(dont knitpick))

Code:
 create temp batch list 'ram' (spends under 80k coin age, new utxo)
  create temp batch list 'hd' (old spends in current block, aged out spends from ram set)
    for each tx of block (0 to ~3000)
         if input < 80k confirm
             list 'Ram' [check then del record](usually 2700 in list(90%))
         else
             list 'Hd' [check then del record](usually 300 in list(10%))
         end if
         output(s) list 'ram' [add record]
    end loop

validate list 'ram' to ram(young forest) utxoset(2700 reads)
delete records(2700 writes)

check ram utxoset for unspents in oldest block (now minus 80k (aged out and need to be in hd dataset))
add to list 'Hd' [add record] (about 3456(64% of 90% of 6000))
remove(3456 old trees) from ram utxo(young forest)

open hard drive utxo file(old forest)
validate list B (300 reads) to hard drive utxoset
delete records(300)
add aged out unspent (write 3456 old trees)
close file

add new unspent outputs to ram utxoset(~6000 writes young forest)
purge lists 'ram' & 'hd'

as oppose to
Code:
open hard drive utxo file  
    for each tx of block (0 to ~3000)
       check input is in file
       del input
       add 2 outputs
    end loop
close file

what you find is allthough a block is 90% new 10% old utxo on average spent and 6000 new unspent
eventually in 18 months there might be 36% of that 90%(3456) still unspent that need to be shifted from ram to hard drive every block(shifted from one forest to the other)
so your still having to access the hard drive and open the file every block.

yes i can see that even though a hard drive does previously
  read 3000
  write: del 3000 add 6000
which becomes
  read 300
  write: del 300 add 3456

ram becomes
  read 2700
  read 3456
  write: del 2700 del: 3456: add: 6000

the amount of hard drive read reduction is 90%
the amount of hard drive write reduction 60%
the amount of ram read increase
the amount of ram write increase
but the amount of operations have increased from like 12000 to ~27000 operations
so its now 2.3x cpu utility to do this

trying to save hard drive ware (single forrest ware) but at the cost of more cpu burn. is not a great idea

.. i see other flaws. but i wont get into those just yet
full member
Activity: 228
Merit: 156
Quote
Ex.
This burn address
https://blockchair.com/bitcoin/address/1111111111111111111114oLvT2
contain alone more than 2¹⁷ UTXOs residing in the Merkle Tree, clearly moving them to a separate tree will yield a better performance
If this may not because they're written in OP_Return (honestly I'm not sure), this 2¹¹ r definitely going to
https://blockchair.com/bitcoin/address/1CounterpartyXXXXXXXXXXXXXXXUWLpVr

I think the improvement from just separating burned UTXOs, coinbase UTXOs to like 3 trees will be obvious at once whatever the Merkle structure is Utreexo forest (to 3 forests) or even a regular Merkle Tree into 3 trees to see money on the table as the expression says(money here means less proof sizes) so people can start hearing u. Then we go enhance more

Quote
Right now I can only think of coinbase scripts, OP_RETURN scripts and everything else so that makes three types.
-For ex. the normal 1:2 Transaction has different expected lifespan for each output as they said& explained in the quoted Comments I added in the original post
.
Quote
from what i can fathum from the topic creators idea (from many forum subcategories posts)
is instead of this one bloated database

have 2-3 databases.
I'm not saying 3 databases, the values database as it is, the hashes in the Merkle Tree r what I'm talking about. Instead of linking the UTXOs hashes to 1 tree, I'm suggesting say 3 trees (or maybe one weighted tree in another design, but this is the straight forward one). So that those with longer lifetimes, could be forever if burned, don't burden the tree of common faster spending ones who live for block or two (I claim this way the tree will have much less height, thus the proof/witness will contain less hash values)
.
For example the Utreexo team could add just 2 if statements to remove away the 2types (burned & coinbase), just as a test before building 2 more trees, to check the improvement in the proof length of the common group of UTXOs
legendary
Activity: 4424
Merit: 4794
First of all it is important that we actually have the categories of UTXOs named beforehand, which boils down to classifying them by script. Right now I can only think of coinbase scripts, OP_RETURN scripts and everything else so that makes three types.

My question is how will you optimize UTXO storage for a specific kind of script, e.g. making OP_RETURN UTXOs more compact? Or more specifically, where in the merkle forest will this try to place each of these three script types (and whether there are more types you have in mind to categorize with)


the standard bitcoin-core UTXO(chainstate db) is usually laid out like
EG
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
<------------------------------------------------------------------------------>
                                           |
                                         TXID
01 04 8358 00 816115944e077fe7c803cfa57f29b36bf87c1d35 8bb85e
<><><--><><--------------------------------------------------> <---->
  |  |      |      \                  |                                                       |
  |  |    value   |           address                                            blockheight
  |  code       tx type
  |              
version

from what i can fathum from the topic creators idea (from many forum subcategories posts)
is instead of this one bloated database

have 2-3 databases. mainly
1. utxosets over 80k blocks ago(older than 18months) block 0-607600
2. utxoset under 80k blocks ago(less than 18month old) block 607601-687601

where by its deems old hoarded coins older then 18months are less likely to move soon
and so not needed to by sifted through every time

his next point, adding in other topic idea of utreexo
is to organise the records.

i presume
blockheight that then branches off the transactions within the blockheight. and then branch off the utxo of the transaction

i am going to predict his next idea in a few weeks is to have 3 databases
whereby its still 2 databases of under/over 18month
but instead of having the txid. ver,code,type, value, address, blockheight
its simply
blockheight:address:index

where that index then seeks a third database for the txid ver,code, type, value

so that when checking if an input is actually a UTXO it just seeks if the address is in the blockheight
and if found then seeks the extra data in the 3rd database
where blockheight+index finds the corresponding branch for details of the utxo

..

in short database 1. is always open(in ram)
whereby with only holding 80k blocks its very lean
4byte blockheight* 80k
20byte address * ~3000 *80000
2byte index *3000 * 80000

the other 2 databases are stored on harddrive and are only opened when needed
thus reducing the amount of ram used by not storing all blockchain of utxos (EG 10m instead of 80mill utxo)
not using as many hard drive reads. thus less wear on hard drives

the problems with this:
utxo's can be formed and spent in the same block
though its more organised to find a tx by its blockheight its falling foul of errors of duplications and other issues
some people still re-use address and so identifying each utxo idependantly is crucial

as for other methods of treeing utxo together. well most uxto are independant of each other. you cant really cluster transactions together based on whatever social analysis topic creator tries.
even spends originating from exchanges. once distributed you cant really code if statements to know if one utxo is going to be hoarded for 2 years or spend in less time
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
First of all it is important that we actually have the categories of UTXOs named beforehand, which boils down to classifying them by script. Right now I can only think of coinbase scripts, OP_RETURN scripts and everything else so that makes three types.

My question is how will you optimize UTXO storage for a specific kind of script, e.g. making OP_RETURN UTXOs more compact? Or more specifically, where in the merkle forest will this try to place each of these three script types (and whether there are more types you have in mind to categorize with)
full member
Activity: 228
Merit: 156
1-First the main aim of this research is to reduce the proof/witness sizes used in verifying UTXOs either by making the Merkle structure more balanced (ie less proof path from root to the leaf UTXO), or by imposing some locality of reference in the way they're arranged (ie to make the group of UTXOs needed to be verified in one transaction having a lot of common nodes in their proofs, like for ex. wishing they all r adjacent failing in one subtree)
.
2-The idea of classifying UTXOs into different types is not purely new or have never been thought about before, Cardano have 4-different types in its UTXO-based mode( it has 2 modes)
https://t.co/uX70PGjPhd?amp=1
.
3-The intuition is also not quite strange or "weird" to Bitcoin UTXOs if u look at this Stanford Dec2015 report poster(the complete PDF available online), where u can see 3 histograms drawn.
https://i.stack.imgur.com/IwyO4.jpg
.
4-The median figure here 2017 shows almost 2groups the linear ascending line, and the adjacent to the X-axis group
https://i.stack.imgur.com/RNUAu.jpg
& Another paper in2017
https://i.stack.imgur.com/JFugn.jpg

.
5-My own observations of the very long lifespan of coinbase UTXOs (reaches 10⁵ in many cases, with the least min value=101) was the first motivation for me to check & dig more on the subject
https://m.facebook.com/story.php?story_fbid=1432124977141946&id=100010333725264

-Then follow more notes about the original Utreexo curve,
https://i.stack.imgur.com/HYNSi.jpg
different kinds of transactions.. reward-accumulate(called merge-mine in some paper), reward-distribute, burned UTXOs
https://i.stack.imgur.com/TrBYL.jpg
Ex.
This burn address
https://blockchair.com/bitcoin/address/1111111111111111111114oLvT2
contain alone more than 2¹⁷ UTXOs residing in the Merkle Tree, clearly moving them to a separate tree will yield a better performance

5-Then some comments of colleagues in the group here
https://bitcointalk.org/index.php?topic=5342691.new#new
Quote
Sometimes, depending on the amounts available, and the values of the outputs, it might be possible to make an educated guess as to what happened. Sometimes that educated guess will be right, but it's always possible that the guess is wrong.

Quote
P2WPKH ┌── P2PKH (different type, higher possibility of being someone else)
       └── P2WPKH (same type, higher possibility of being from same wallet)
»»»»»an Update is to handle dust UTXOS
https://bitcointalksearch.org/topic/are-those-non-btc-uses-of-utxos-still-exist-also-dust-ratio-5347545

.
6-All in all, I suggest analyzing all transaction types to put some rules (sequence of if statements/case/switch/.... whatever the underlying programming language) to handle each type separately
.
A draft pseudo code would be

//determine TX& UTXOs kind
Kind=Analyze(TX);
//for all outputs j
Ins(TX.out[j], Tree[kind]);
/*put the UTXOs in the appropriate tree*/
Tree[kind]. modified=TRUE;
/*this is to update this tree root in any cached proofs*/

//for all inputs i
Del(TX.inp);
/*the delete depends on creation kind*/

.
7-First thoughts of handling could be:
 -Different tree/forest for long living UTXOs so their longer lifes don't increase the height of the faster spent ones tree, or one tree with different weights (a weighted tree have longer paths for less frequently used leaves, here we could have longer path for longer stay)
-Try to guess burned UTXOs & put them in a separate tree.
-Also, try to put UTXOs that r likely to be spent together adjacent in the tree (like rewards of same person he will probably accumulate; miners of same pool from previous observation if possible, since he will probably distribute)
.
8-More ideas about the underlying Merkle structure itself, would be better to use ideas from formal data structure literature?
-Buffer Trees (a batch of del&ins is done altogether to enhance performance, isn't that what we do on a block/transaction batch), Van Emde Boas layout ( a threshold level "d" divides the tree to keep a worst case height, usually square root of no nodes)
https://youtu.be/V3omVLzI0WE
-Finger Trees
https://en.m.wikipedia.org/wiki/Finger_tree
https://youtu.be/xVka6z1hu-I
.
9-Finally whatever idea becomes all complete, we must check any security vulnerabilities or possible resulting attacks becoming easier.
.
-More of my thoughts & comments r here
https://mobile.twitter.com/ArafatShymaa/status/1403246321956438017
https://mobile.twitter.com/ArafatShymaa/status/1402465971147841537
Hoping for brainstorming or participants ....

Jump to: