Author

Topic: Pruning and automatic checkpointing (Read 563 times)

legendary
Activity: 1176
Merit: 1134
February 19, 2016, 07:21:07 PM
#6
Reindexing exists exclusively because the local state may have become corrupted.  Trusting the state of a corrupted node is not what you really want to do in a reindex.

Specifically, prior errors in signature validation (nodes not updated for soft-forks or nodes run with an incompatible OpenSSL update) caused nodes to both accept and reject signatures that they shouldn't have accepted. A reindex currently clears that state.

On a fast host a reindex for me takes under three hours; so that puts an absolute upper bound on the improvement possible.

The bigger question is why are you reindexing in the first place?

I think the general direction in Bitcoin Core is a complete removal of checkpoints or anything resembling them. Other fixes have no largely mooted their original utility, and they have reliably caused severe misunderstandings of the security model (including, unfortunately, in academic papers) which have been harmful far beyond the narrow advantages they provide.
Syncing the entire chain and building the data structures from scratch in 30 minutes feels like a pretty not-narrow advantage.

I used to feel the blockchain size was a big problem, but now I can parallel stream process the entire blockchain, it is not a problem at all. On a good connection 500mbps, it resyncs over lunch. Even on a typical home connection of 20mbps, it syncs in 6hrs.

In order to achieve this I had to make everything go in parallel and since it is all verified locally I didnt see any problems.

James
staff
Activity: 4284
Merit: 8808
February 19, 2016, 05:53:34 PM
#5
Reindexing exists exclusively because the local state may have become corrupted.  Trusting the state of a corrupted node is not what you really want to do in a reindex.

Specifically, prior errors in signature validation (nodes not updated for soft-forks or nodes run with an incompatible OpenSSL update) caused nodes to both accept and reject signatures that they shouldn't have accepted. A reindex currently clears that state.

On a fast host a reindex for me takes under three hours; so that puts an absolute upper bound on the improvement possible.

The bigger question is why are you reindexing in the first place?

I think the general direction in Bitcoin Core is a complete removal of checkpoints or anything resembling them. Other fixes have no largely mooted their original utility, and they have reliably caused severe misunderstandings of the security model (including, unfortunately, in academic papers) which have been harmful far beyond the narrow advantages they provide.
legendary
Activity: 1176
Merit: 1134
February 19, 2016, 04:00:55 PM
#4
Also, my design doesnt require reindexing as all the data is directly put into the bundle files with enough indexing information in them to be able to do the required operations. That is why the dataset grew to 25GB. If more speed is needed for the queries another layer of indexing can be added once all the bundle files arrive, but I am seeing decent performance for most queries without that layer for now.

With pruning data is thrown away.  It just seems wasteful to re-validate in addition to having to re-download.
certainly. once validated, just set a flag locally that it was validated. if paranoid about external tampering just save these validated flags in an encrypted file

James
legendary
Activity: 1232
Merit: 1094
February 19, 2016, 03:46:19 PM
#3
Also, my design doesnt require reindexing as all the data is directly put into the bundle files with enough indexing information in them to be able to do the required operations. That is why the dataset grew to 25GB. If more speed is needed for the queries another layer of indexing can be added once all the bundle files arrive, but I am seeing decent performance for most queries without that layer for now.

With pruning data is thrown away.  It just seems wasteful to re-validate in addition to having to re-download.
legendary
Activity: 1176
Merit: 1134
February 19, 2016, 03:31:51 PM
#2
One of the disadvantages with pruning is that you have re-download everything for re-indexing.

When downloading the blockchain, the reference client skips signature validation until it reaches the last checkpoint.  This greatly speeds up processing and then it slows down.

There is no loss in security by self-checkpointing.  When a block is validated, a checkpoint could be stored in the database.  This would be a "soft" checkpoint.  It would mean that signature validation doesn't have to happen until that point.

In 0.12, the last block to be checkpointed has a height of 295000.  The block is over 18 months.  Core has to verify all blocks that were received in the last 18 months.

When downloading, core will download blocks 0 to 295000 without performing signature validation and then fully validates everything from 295000 to 399000.  If re-indexing is requested, then it has to validate the 100k blocks a second time.  There is no security value in doing that.

Instead, the client could record the hash for blocks that have already been validated.  If a block is more than 2016 blocks deep and has a height that is divisible by 5000, then soft checkpoint the block.

It is possible that the new signature validation library resolves the problem.  That would make the problem moot.
In iguana I save both the first blockhash and the hash(2000 blocks) for each blockhdrs for every 2000 blocks. This allows verification of the entire header as it comes in and parallel loading of all bundles of blocks in each header, with assurance that it will be the right set of blocks;

Due to spends referencing still unseen outputs, full signature validation cant be 100% done until all the bundles are processed in the second pass (the first pass gets the raw blocks), however a lot of things can be verified during the second pass and I create read only files for each bundle.

Since they are read only files, the entire set of them can be put into a squashfs to reduce its size to about 15GB (will probably be 20GB when I get all the data into the bundle file). The read only files include bloom filter lookup tables so by memory mapping them, it gets an in-memory structure that can directly be used for queries, without any time needed at startup. Another advantage of the read only format is that once it is validated, it doesnt change, so doesnt need to keep getting verified each restart. [I should add some checks to make sure the files havent changed to prevent external tampering.]

Since it only takes 30 minutes for the two passes, I think I will add a third pass for the signature verification to avoid needing to do partial sig verifications and then resuming, etc.

Also, my design doesnt require reindexing as all the data is directly put into the bundle files with enough indexing information in them to be able to do the required operations. That is why the dataset grew to 25GB. If more speed is needed for the queries another layer of indexing can be added once all the bundle files arrive, but I am seeing decent performance for most queries without that layer for now.

James
legendary
Activity: 1232
Merit: 1094
February 19, 2016, 02:50:34 PM
#1
One of the disadvantages with pruning is that you have re-download everything for re-indexing.

When downloading the blockchain, the reference client skips signature validation until it reaches the last checkpoint.  This greatly speeds up processing and then it slows down.

There is no loss in security by self-checkpointing.  When a block is validated, a checkpoint could be stored in the database.  This would be a "soft" checkpoint.  It would mean that signature validation doesn't have to happen until that point.

In 0.12, the last block to be checkpointed has a height of 295000.  The block is over 18 months.  Core has to verify all blocks that were received in the last 18 months.

When downloading, core will download blocks 0 to 295000 without performing signature validation and then fully validates everything from 295000 to 399000.  If re-indexing is requested, then it has to validate the 100k blocks a second time.  There is no security value in doing that.

Instead, the client could record the hash for blocks that have already been validated.  If a block is more than 2016 blocks deep and has a height that is divisible by 5000, then soft checkpoint the block.

It is possible that the new signature validation library resolves the problem.  That would make the problem moot.
Jump to: