Using compact indexes instead of hashes as identifiers. - page 2.

jl777

legendary

Activity: 1176

Merit: 1134

A side note.

By combining each readonly bundle with balances for all the addresses, it will be possible for validating any consecutive number of bundles by iterating through all the tx in each bundle and updating balances.

Regardless of where the iterations are started, all paths arriving at the same place will have the identical ledger.

Using the method described above, every ledger balance can be verified. Kind of like auditing quarterly/annual financials. Once the books are closed, then only historians need to worry about each tx.

With the data so much more compact, the urgency is not as great, but at some point it will be nice to simply accept a set of starting balances for a given bundle along with unspents and sync from there.

I am hoping you know a way for these balances of a recent bundle to be trustable without having to replay all the bundles. Maybe using aggregated signature or pederson commitments or can prove that the balances of the latest bundle is indeed matching the sum of the changes from all the previous bundles? If that is possible, then the "only" thing that would be needed is a way to know that the contents of a bundle can be trusted.

Maybe just a set of hashes that are propagated would be enough for that, but I am sure there are more secure ways. A large amount of processing would be justifiable to create such a ledgerchain that is verifiable.

Thanks for all the advices!

James

jl777

legendary

Activity: 1176

Merit: 1134

Quote from: gmaxwell on February 26, 2016, 12:09:46 AM

Something we must fight against if we want Bitcoin to retain it's utility in the future. No one uses a system of money with fungibility and privacy as bad as a bitcoin that has lost it's properties.

Of course and I dont believe anything I have done would cause things to get any worse. Just trying to create a scalable framework based on bitcoin that can also be privacy enhanced.

Quote

We do not store the blockchain in a database in Bitcoin Core.

Ok, so storing the raw network data in a raw file isnt a database, my point was that to do most all RPC, the DB is involved and I think it is fair to say that to most people the RPC path is what is used.

Quote

txindex=1 is not very compatible with scalablity; you end up with an ever-growing index. It makes some operations faster, but at a considerable cost.

In Bitcoin core reindex spends a fair amount of time rehashing things to check data integrity, because later it is assumed to be true. It could easily be made very fast-- but at the cost of additional indexes. Right now the marginal cost of a full node wallet that supports arbritary key import is about 65GB and rapidly growing. (Pruned vs non-pruned plus txindex).

Certainly txindex=1 adds overhead, but the method of creating unchanging bundle files each that has hashtables to find everything internally and where all unspents to the same destination are in a linked list, the overhead is contained within each bundle. I dont see why anything has to ever be verified again once it has been verified and put into a read-only file. For the paranoid, a hash of each bundle file could be made and verified before using it.

The tracking of the status of unspents is the only global data that needs to be updated and it is a write once for each entry.

Over time, an operation might need to scan all bundles, but I just got benchmarks on a 1.4Ghz i5, taking 2.5 milliseconds to scan 1900 bundles (for a coin using 500 block bundles). So it is not very fast, but would be fast enough for most use cases and comparable to RPC overhead where things are serialized with system locks. And this with the ability to find all utxo for any address. Also note the parallel design makes multithreading ideal and speedups using multiple cores would be trivial to do.

If even more performance is needed, each speed boosted address can just keep things cached.

Without sigs, my data set for txindex=1 is about 25GB uncompressed and 15GB compressed. I am thinking of defaulting to not storing the sigs, as once a spend is validated and the bundle is marked as sigs and inputs verified, I dont see so many cases where the sigs are needed anymore. Certainly not at the cost of doubling the total space.

Quote

If it's purely local, you could probably always number that way, so long as you had code to fix up the numbering during reorg.

What number of blocks is beyond the common sense reorg danger? I know theoretically very long reorgs are possible, but from what I have seen it is not so common for any big reorg. I think by setting a delay of N blocks before creating the most recent bundle, then the odds of having to regenerate that bundle would be very low. Since it takes about a minute to regenerate it, it isnt a giant disaster if it has to, but best to default things to almost never has to do it.

Then by keeping the full sized form of the data for N blocks, I think the advantage of compactness is achieved without any negatives [with the caveat that any reorg past the N blocks requires to regenerate the bundle and everything past it], and these are all purely local optimizations.

I was wondering if it was possible to get a networkservices bit? That would indicate a node is available for tx based queries for other nodes. Since txindex=1 comes for free I figure it would be a good thing to make available for other nodes, in addition to sharing hashes about the bundles. This would make things not purely local, but it would operate orthogonally to existing peers and just help bootstrap new nodes and lightweight nodes, but it would only be affecting other nodes that have this bit set. I saw in the docs to say to ask about such bits or just start using one. Not sure the exact process to get a bit allocated

Quote

There is an additional test harness built from BitcoinJ that simulates a bitcoin network and puts the node through its paces. It's called blocktester. Look at the travis ci configuration in Bitcoin core to get an idea of the tests that are automatically run on it.

Fantastic! this will help a lot. thank you

Quote

Standard warnings about the near impossibility of writing new code which is consensus compatible with other code apply-- if you don't find some new surprising behavior in the network that we don't have tests for, you're almost certainly not testing hard enough to have a chance at it.

I know. Unfortunately I am used to doing the impossible stuff and external events forced me on this path in the last months. I started the iguana iteration last November.

The first production iteration wont be a mining enable version, my target audience (mass market) isnt really having so many ASIC's handy. So by following the network consensus it avoids having to recreate all the intricate consensus rules for all the variations in the installed base.

I can recommend everyone to write their own bitcoin core equivalent. It sure is a good way to learn how it all works

James

gmaxwell

staff

Activity: 4284

Merit: 8808

[I split the thread, because we were getting away from talking about the compact aggregated signatures; into things which don't change the network behavior]

Quote from: jl777 on February 25, 2016, 10:50:34 PM

In fact they might be required to be public.

Something we must fight against if we want Bitcoin to retain it's utility in the future. No one uses a system of money with fungibility and privacy as bad as a bitcoin that has lost it's properties.

Quote

in many cases. I think the usage of a DB is the cause of most of the slowdown. Since the blockchain mostly never changes, there is no need for using ACID compliant DB operations for everything.

We do not store the blockchain in a database in Bitcoin Core.

Quote

With my design all nodes are doing txindex=1 and it takes very little time to "rescan" a new privkey as everything already has the tx already in linked lists and hash tables.

txindex=1 is not very compatible with scalablity; you end up with an ever-growing index. It makes some operations faster, but at a considerable cost.

In Bitcoin core reindex spends a fair amount of time rehashing things to check data integrity, because later it is assumed to be true. It could easily be made very fast-- but at the cost of additional indexes. Right now the marginal cost of a full node wallet that supports arbritary key import is about 65GB and rapidly growing. (Pruned vs non-pruned plus txindex).

Quote

so not sure why you consider it bad software. I assume you dont consider using gzip a bad practice?

The "bad software" wasn't referring to local optimizations by any means. What I mean is that someone who doesn't like privacy can release a wallet which forces its users to always reuse addresses and then justify it with "reuse is more efficient because things use the indexes"-- but even this would only apply if they were used 'on the wire'; if they're purely local, they're purely local. The txindex in Bitcoin Core works internally with compact indexes in a similar way, in fact.

I'm super supportive of purely local optimizations.

Quote

(when beyond the practical chance of being reorged) as 32 bits.

If it's purely local, you could probably always number that way, so long as you had code to fix up the numbering during reorg.

Quote

I wanted to use the standard bitcoin RPC test suite and I assume what is in the repo is the most complete set? Is there some bruteforce RPC tester that will use large parts of the RPC that can be used as a validation step for new implementation of bitcoin core?

There is an additional test harness built from BitcoinJ that simulates a bitcoin network and puts the node through its paces. It's called blocktester. Look at the travis ci configuration in Bitcoin core to get an idea of the tests that are automatically run on it.

Standard warnings about the near impossibility of writing new code which is consensus compatible with other code apply-- if you don't find some new surprising behavior in the network that we don't have tests for, you're almost certainly not testing hard enough to have a chance at it.

jl777

legendary

Activity: 1176

Merit: 1134

Quote from: gmaxwell on February 25, 2016, 10:18:22 PM

Quote from: jl777 on February 25, 2016, 09:13:40 PM

An implementation idea would be to assume that SIGHASH_ALL is used for all these, it seems that other SIGHASH modes are rarely used and not sure it makes sense to support them for radically new usecases.

Well everything I described there is completely compatible with all scripthash types; so no real need to limit flexibility. Assuming the sighash code is separate and reused, it shouldn't increase implementation complexity.

Quote

Also, is there a reason that a unique number be used to identify each txid and even output script (address)? To my thinking, after a block is past any chance of being reorganized, then there is a canonical ordering of all blocks, and therefore all tx and vins and vouts.

Since each spend currently requires the 32byte txid and vout, mapping this to 4 or 6 bytes creates a lot of space savings. There is a lot of redundancy in the blockchain with each txid potentially being duplicated once for each vout.

That could be done-- but there is no need to do so normatively. E.g. Peers could agree to transmit data using these compressed indexes, while still hashing the original values. This has the advantage of having the IDs not change out from under already authored transactions, and making sure that offline devices don't need access to (or to trust) external index providers.

Quote

The other big redundancy are reused addresses, which are actually rmd160 hashes inside spend scripts. Using the canonical ordering of everything, then each rmd160 hash would map to a 32bit index and with the vast majority of scripts being standard a few more bytes can be encoded into a few bits. However, with the coming diaspora of p2sh script proliferation, it could be that address numbering wont be so effective in the future.

This would undermine the privacy/fungibility properties of bitcoin as a whole to incentivize parties to reuse addresses. Bitcoin's privacy strongly depends on non-reuse, and additional pressure against it for the benefit of saving on the order 12 bytes per txout doesn't sound like a win-- especially as it would be used to justify bad software, bad businesses practices, and bad public policy that force users to reuse. Access to those indexes would have to be handled quite carefully since if you paid the wrong one the funds would be stolen.

Thanks for your thoughts!

I used to think BTC blockchains' exponential growth was a big problem. Then I realized I could distill it to one quarter the size and also sync in parallel at bandwidth speeds. Now I dont worry about blockchain bloat so much.

I think we have to admit that a large part of the BTC blockchain has been deanonymized. And until new methods like CT are adopted, this will only get worse.

So rather than fight a losing battle, why not accept that there are people that dont care much about privacy and convenience is more important. In fact they might be required to be public. The canonical encoding allows to encode the existing blockchain and future public blockchain at much better than any other method as it ends up as high entropy compressed 32bit numbers, vs 32byte txid + vout. The savings is much more than 12 bytes, it takes only 6 bytes to encode an unspent, so closer to 30 bytes. A lot of bitcoin processing is slow due to a variety of reasons. Being able to do computations via standard integers allows for an order of magnitude faster operations in many cases. I think the usage of a DB is the cause of most of the slowdown. Since the blockchain mostly never changes, there is no need for using ACID compliant DB operations for everything. With my design all nodes are doing txindex=1 and it takes very little time to "rescan" a new privkey as everything already has the tx already in linked lists and hash tables.

The assumption is that each node is maintaining the canonical indexes, otherwise trust is required. With that assumption, this indexing system can be viewed as a highly efficient lossless codec, so not sure why you consider it bad software. I assume you dont consider using gzip a bad practice? I try hard to preserve compatibility with the existing bitcoin core. What I describe does not change the network protocol at all, that is a separate post

I am not advocating to change the signed tx to be based on 32 bit numbers! All I recommend is to encode them internally (when beyond the practical chance of being reorged) as 32 bits. All nodes will end up with the same numbering. However, all external interactions continue with the fully expanded form, else it wouldnt be compatible at all. All bundles and subsets of bundles would have verifiable hashes and this will create an efficient encoding for permanent archival storage, with arguably isomorphic behavior to the current raw blockchain inside a DB. The tx that are signed and verified are of course using the fully expanded form. However, if you just want to explore the blockchain, nobody really cares about the exact txid, just that funds went from A to B.

These are not just thoughts, but description of a working system that syncs entire blockchain and creates the above data structures in about half hour if you have enough bandwidth and 8 cores. It is bandwidth limited, so for a slow home connection of 20mbps, it takes 6hrs for full sync. I am currently debugging the unspents handling, but the rest is basically working and just needs validation. And this part is derived from MGW which has been in service for a couple years.

I wanted to use the standard bitcoin RPC test suite and I assume what is in the repo is the most complete set? Is there some bruteforce RPC tester that will use large parts of the RPC that can be used as a validation step for new implementation of bitcoin core?

James

Topic: Using compact indexes instead of hashes as identifiers. - page 2. (Read 2867 times)