Pages:
Author

Topic: A bitcoin blockchain parser in a few (thousand) lines of C++ - page 4. (Read 16979 times)

sr. member
Activity: 280
Merit: 257
bluemeanie
I've been doing Block Chain processing as well in Java.

http://i.imagebanana.com/img/g11ev2ol/Selection_018email.png

this diagram shows the famous Hal Finney transaction (block #170), which spends an output from block #9.  This output is later spent in block #181, then #182, and then spent in both #183 and #221.

I'm getting my data from the Bitcoind REST APIs though, it's a bit slower, but once the primary ETL is done then you don't have to worry much about it.

the data set though is far too large for the typical end user of Bitcoin though.


............. another cool one.  looks like a miner collected a bunch of coins into one account and then spent it to another.

http://i.imagebanana.com/img/9e42925q/Selection_019.png
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
Worrying about block validation is a whole new world you don't want to get into.  For now you can simply leverage the fact that if the block is in the blk*.dat files it is valid along with all the transactions in it.   As TierNolan said though, you'll probably have a few orphans in there, so you probably need your scanning code to either compute longest chain (which is a good exercise in itself) or just not assume that 100% of the blocks in those files are in the main chain.

For a first exercise, you don't need a full script evaluation engine.  Just program in enough to recognize common TxOut scripts and ignore the non - std scripts.  Although in terms of computing balances, it's kind of arbitrary... You can store every TxOut in a map and remove them as you see TxIns that spend them.  When you're done, you should have a map containing all unspent TxOuts on the network.

Also for your analysis of looking for dormant coins being spent, look up "bitcoin days destroyed"
legendary
Activity: 1246
Merit: 1002
I also do projects like this to learn.  I figure when someone says I don't understand something, a piece of from scratch working code is a powerful answer.

My next step would be to find the code snippets in the reference client that can do the parsing, and comprehend them with my new found knowledge, then write a utility that uses the reference client code when possible.  I always do that sort of thing, I usually get frustrated mid-way through.



legendary
Activity: 1232
Merit: 1094
I suppose the question is can I parse out the transactions without also having to do the full validation?  If I have to, then so be it, if not, that's fine too.

I don't think the reference client actually writes blocks to disk unless they have been validated.

One problem is that all blocks are written to the disk when validated, so you may have some blocks from orphan chains on the disk.  You would need to exclude them from your analysis.

The easiest way would be to just take the longest chain in terms of blocks without worrying about POW.

If you ignore the last 100 blocks in the main chain, then you should get a single clean chain without having to worry about orphans.
member
Activity: 82
Merit: 10
Thanks for the feedback, I agree this is the best way to learn about the blockchain.

By 'parser' I mean something that can scan every file in the block-chain on disk and return each block in a data structure suitable for further processing.

That is all my code snippet does, nothing more and nothing less.  It's a first starting point anyone would need to do if they wanted to crack the walnut of data inside the block-chain files.

The next step is to actually interpret the data in those blocks and convert them into specific transactions.  That requires actually parsing the input and output scripts.  To do that, I have to implement the script virtual machine.  Now, I could do that by borrowing someone else's source code, or I could just write it for myself as a learning exercise.

My personal goal is to extract all of the transaction data.  I don't want to write a full client, I don't want to submit or even validate transactions.  I suppose the question is can I parse out the transactions without also having to do the full validation?  If I have to, then so be it, if not, that's fine too.

My personal motivation in doing this is that I want a way to track the embedded 'value' in all outstanding bitcoin transactions by comparing the value of the coin in US dollars at the time it was last transferred.  I'm particular interested in large quantities of BTC which haven't moved for years but suddenly 'come to life'.

Perhaps someone has already written such an analysis tool, but I haven't seen one yet. 

At any rate, I'm mostly doing this to satisfy my own personal curiosity and if someone else finds the source code useful and/or educational, that's a bonus.

So, I guess my follow up question is this.  What part of the scripts represent the actual transaction; i.e. 'This amount from this BTC address to this BTC address.' Is that in both the input and output script?  Do I have to run the full cryptography to get that answer, or is that just part of the basic information when you parse the script?

Thanks,

John
legendary
Activity: 2053
Merit: 1356
aka tonikt
sorry, what do you mean by "blockchain parser"?
of course it doesn't need any external dependencies, since it doesn't even seem to bother about needing sha256 at any place.
but myself, I cannot imagine a bitcoin blockchain parser that would not need to use sha256
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
I just wrote a bitcoin blockchain parser in a tiny code snippet.  This is implemented as just two source files, one header and one CPP, without any external dependencies.  This code snippet does not use any memory allocation, templates, containers, STL, Boost, or anything more complicated than 'fopen, fread, fclose'.

I wrote this mostly as a learning tool so I could understand the blockchain format myself.

While writing it I ran into a couple of minor issues.  One, is that sometimes a blockchain data file will just run out of data, the rest of the file containing zeros.  I don't know if this is normal or expected, but I treat it as an end-of-file condition.  

The other is that some blocks contain less data than is indicated by the block length; meaning that after all of the transactions are read in, the file pointer has not advanced as far as the 'block length' which was indicated.  I am going to assume this is normal and expected?

This code snippet parses my copy of the blockchain (9.2gb) in roughly 95 seconds; which I figured was pretty good; though I don't know what I have to compare to.

At any rate, if anyone finds this code snippet useful or just wants to better understand the data layout of the bitcoin blockchain you can find it here:

http://codesuppository.blogspot.com/2013/07/a-bitcoin-blockchain-parser-as-single.html

Feedback, bugfixes, suggestions, all welcome.

Thanks,

John



Being someone who wrote a blockchain parser from scratch (which is what Armory does on every load/rescan), I can tell you that everything about that is normal (except for the second point below).   And 95 seconds is pretty good.  It depends what else you're doing while you're scanning.  In my case, my code scans in about 150-240 seconds depending on HDD caching, but it's also identifying all the transaction scripts in order to search for coins that are related to my wallet (so I additionally parse and identify individual scripts and then do a lookup in a set structure to see if it's relevant).

  • The zeros at the end of files are due to pre-allocation.  Since Bitcoin-Qt version 0.8, they now always allocate blk*.dat files in chunks of 16 MB to reduce disk thrashing.  i.e. if you keep appending to a file over and over again, the OS is forced to keep reallocating that space, or fragmenting it.  By preallocating with zero-padding, the number of disk reallocations are cut down dramatically
  • I'm not sure what would cause the numBytes field to be incorrect.  How often do you see it?  Do you have an example?

I think doing exactly what you did is one of the best ways to learn the ins and outs of Bitcoin.  If you want to do more, I recommend you pick some addresses and try adding code to watch for that address and accumulate its balance and spendable UTXOs.
member
Activity: 82
Merit: 10
I just wrote a bitcoin blockchain parser in a tiny code snippet.  This is implemented as just two source files, one header and one CPP, without any external dependencies.  This code snippet does not use any memory allocation, templates, containers, STL, Boost, or anything more complicated than 'fopen, fread, fclose'.

I wrote this mostly as a learning tool so I could understand the blockchain format myself.

While writing it I ran into a couple of minor issues.  One, is that sometimes a blockchain data file will just run out of data, the rest of the file containing zeros.  I don't know if this is normal or expected, but I treat it as an end-of-file condition.  

The other is that some blocks contain less data than is indicated by the block length; meaning that after all of the transactions are read in, the file pointer has not advanced as far as the 'block length' which was indicated.  I am going to assume this is normal and expected?

This code snippet parses my copy of the blockchain (9.2gb) in roughly 95 seconds; which I figured was pretty good; though I don't know what I have to compare to.

At any rate, if anyone finds this code snippet useful or just wants to better understand the data layout of the bitcoin blockchain you can find it here:

http://codesuppository.blogspot.com/2013/07/a-bitcoin-blockchain-parser-as-single.html

Feedback, bugfixes, suggestions, all welcome.

Thanks,

John
Pages:
Jump to: