One of the disadvantages with pruning is that you have re-download everything for re-indexing.
When downloading the blockchain, the reference client skips signature validation until it reaches the last checkpoint. This greatly speeds up processing and then it slows down.
There is no loss in security by self-checkpointing. When a block is validated, a checkpoint could be stored in the database. This would be a "soft" checkpoint. It would mean that signature validation doesn't have to happen until that point.
In 0.12, the last block to be checkpointed has a height of 295000. The block is over 18 months. Core has to verify all blocks that were received in the last 18 months.
When downloading, core will download blocks 0 to 295000 without performing signature validation and then fully validates everything from 295000 to 399000. If re-indexing is requested, then it has to validate the 100k blocks a second time. There is no security value in doing that.
Instead, the client could record the hash for blocks that have already been validated. If a block is more than 2016 blocks deep and has a height that is divisible by 5000, then soft checkpoint the block.
It is possible that the new signature validation library resolves the problem. That would make the problem moot.
In iguana I save both the first blockhash and the hash(2000 blocks) for each blockhdrs for every 2000 blocks. This allows verification of the entire header as it comes in and parallel loading of all bundles of blocks in each header, with assurance that it will be the right set of blocks;
Due to spends referencing still unseen outputs, full signature validation cant be 100% done until all the bundles are processed in the second pass (the first pass gets the raw blocks), however a lot of things can be verified during the second pass and I create read only files for each bundle.
Since they are read only files, the entire set of them can be put into a squashfs to reduce its size to about 15GB (will probably be 20GB when I get all the data into the bundle file). The read only files include bloom filter lookup tables so by memory mapping them, it gets an in-memory structure that can directly be used for queries, without any time needed at startup. Another advantage of the read only format is that once it is validated, it doesnt change, so doesnt need to keep getting verified each restart. [I should add some checks to make sure the files havent changed to prevent external tampering.]
Since it only takes 30 minutes for the two passes, I think I will add a third pass for the signature verification to avoid needing to do partial sig verifications and then resuming, etc.
Also, my design doesnt require reindexing as all the data is directly put into the bundle files with enough indexing information in them to be able to do the required operations. That is why the dataset grew to 25GB. If more speed is needed for the queries another layer of indexing can be added once all the bundle files arrive, but I am seeing decent performance for most queries without that layer for now.
James