Hi Bill,
I notice you haven't pushed any changes to the git repos recently. How are things coming?
Briefly looking through your code, it doesn't seem like you have worked on creating the ledger snapshot from bitcoin. That's actually what I'm going to be working on right now, with my bitcoin spinoff toolkit.
I was tempted to write the spinoff toolkit in ocaml (I much prefer it to C++) and maybe submit a pull request to you, but ultimately I wanted something that was easy to integrate with existing cryptocurrency codebases and something that more of the general public could contribute to. Of course, in reality, I'm unlikely to get contributions either way.
Well, I hope things are going well for you. Qeditas seems like a very exciting and daunting project.
The Qeditas code is in progress, but still unfinished. I decided to push my most recent commits to the remote repository this morning. Coincidentally, the thing I've been doing the past few days is creating the initial ledger (as a compact/Merkle tree) from the snapshot I made of the Bitcoin block chain last March.
You're correct that code to create a snapshot is not in any of the git repositories. I did write some ocaml code to do this (in cooperation with bitcoind) but most of the code is a mess. For example, there are references to specific directories and filenames where I saved intermediate results. It took roughly 6 weeks (from mid-February to the end of March of this year) to successfully construct the snapshot I wanted (through the first 350,000 blocks). The end result of the snapshot is here:
https://mega.co.nz/#F!0xhzUbTS!h-LeyiTbF9dXgusKCc_bfQIt's split into p2pkh balances and p2sh balances. In each case the address is given as the important 20 bytes in hex and then the balance of satoshis at the snapshot. There are also two small python scripts people can use to check their balances by giving the corresponding Bitcoin address. Once I had the snapshot I wanted to move on to coding Qeditas itself, so I never spent more time with the code to generate the snapshot.
The biggest problem I faced were resource bounds. What I would try to do would either require more RAM than I had, too much disk space or too much time. I spent most of my time modifying my strategy to compute the snapshot to work around these bounds. A more reasonable strategy probably would have been to buy more advanced hardware. Here are a few things I can say that might be helpful:
1. The blocks are all stored in .bitcoin/blocks/*.dat files and one can read the relevant data from there without needing bitcoind (now bitcoin-cli I suppose). However, it is dangerous. The files include orphan blocks. I'm also unsure whether the blocks are guaranteed to be in the correct order. After realizing this, I used bitcoind with getblockhash to explicitly give me the hash of each of the 350,000 blocks in the main chain. At this point it's possible to use the *.dat files to read blocks, skipping orphans, or it's possible to use bitcoind getblock with each of these hashes. I tried both, and am no longer sure which variant was used to make the final snapshot.
2. I went through all the txs in the main chain and computed that only the first 7 bytes of the 32 byte hash is needed to uniquely identify the tx. There are two exceptions of txs for which there are 'twin' txs with the same txid (since the txs look exactly the same):
d5d27987d2a3dfc724e359870c6644b40e497bdc0589a033220fe15429d88599
e3bf3d07d4b0375638d5f1db5255fe07ba2c4cb067cd81b84ee974b6585fb468
These were due to bugs in bitcoin having to do with ignored coinbase inputs. I chose to ignore the duplicate and only count each tx once. My understanding is that in bitcoin the relevant txos can only be spent once in spite of being in the block chain twice.
The important point is that it was sufficient to only use 7 bytes to identify txs instead of 32. Since there have been many txs since the end of March, it's possible 7 bytes is no longer enough.
3. My first attempt to create a snapshot was to go forward through the blocks and store all utxos until they were spent. In principle, at the end I would have had all the utxos and hence the snapshot. This ran out of memory. As a workaround, I tried reading a certain number of blocks and then saving the state (utxos so far) and then restarting. This still wasn't good enough. Then I tried saving all the txs in reverse order. I kept in memory all the spent txos that I had not yet seen created. Each time I encountered a new txo, it was either known to be spent in the future or I would know it should go into the snapshot. This also ran out of memory. At this point, I would have to carefully review my notes to see what exactly I did, but it's at least clear that I saved some very large files named "spentxxx" and "unspentxxx" where xxx ranged from 000 to roughly 250 -- mirroring the names of the block*.dat files. I do roughly remember that I created some of these files by processing the blocks forward and the other files by processing the blocks backward, and extracting the snapshot from a combination of the information.
I hope you can find a more robust solution. Creating a snapshot was a surprisingly difficult hurdle for creating a spinoff.