Pages:
Author

Topic: A bitcoin blockchain parser in a few (thousand) lines of C++ (Read 16981 times)

legendary
Activity: 1260
Merit: 1019
can this parser be used to do an audit of bitcoin total coins?

How do we know that there are exactly the number of coins in the blockchain as have been mined since the beginning?
Very simple. Sum of all unspent yet outputs.
The second question: how many bitcoins were lost?
The answer: sum of all unspent outputs which are provable unspendable
legendary
Activity: 1639
Merit: 1006
can this parser be used to do an audit of bitcoin total coins?

How do we know that there are exactly the number of coins in the blockchain as have been mined since the beginning?
hero member
Activity: 708
Merit: 502

https://code.google.com/p/blockchain/source/checkout

recently was refactored and now works grerat! thank you!
hero member
Activity: 708
Merit: 502
Hi! your tool is definitely usefull! however since blockchain become huge (~30 GB) and more then 52 mln addresses and 200+ inputs outputs - tool is no longer working ( moreover it`s not possible to recompile it changeing just:

#define MAX_BITCOIN_ADDRESSES 48000000 // 40 million unique addresses.
#define MAX_TOTAL_TRANSACTIONS 90000000 // 40 million transactions.
#define MAX_TOTAL_INPUTS 268000000 // 200 million inputs.
#define MAX_TOTAL_OUTPUTS 268000000 // 200 million outputs
#define MAX_TOTAL_BLOCKS 600000      // 1/2 million blocks.

because of it leads to Warning   16   warning C4307: '*' : integral constant overflow

these are the highest numbers tool is still compile-able however
it won`t let you poarse blockchain after.

It would be much appreciated if you can recompile it to be compatible with current blockchain size! or tips on how to do it )))
Thank you.




good news - duatiugame  refactored code and it`s working again:
https://code.google.com/p/blockchain/source/checkout

#define MAX_BITCOIN_ADDRESSES 60000000 // 60 million unique addresses.
#define MAX_TOTAL_TRANSACTIONS 70000000 // 70 million transactions.
#define MAX_TOTAL_INPUTS 250000000 // 250 million inputs.
#define MAX_TOTAL_OUTPUTS 250000000 // 250 million outputs
#define MAX_TOTAL_BLOCKS 600000      // 600,000 blocks.

hero member
Activity: 708
Merit: 502

Can this be used to compute the bitcoin rich list (list of all addresses with balance greater than x)?

Yes.



I could not find a docs on how to do it - can you plz help? for example i want to get every address from blockchain and sort it in order of higher current balance.
Thank you!
legendary
Activity: 1596
Merit: 1100
hero member
Activity: 708
Merit: 502
I just wrote a bitcoin blockchain parser in a tiny code snippet.  This is implemented as just two source files, one header and one CPP, without any external dependencies.  This code snippet does not use any memory allocation, templates, containers, STL, Boost, or anything more complicated than 'fopen, fread, fclose'.

I wrote this mostly as a learning tool so I could understand the blockchain format myself.

While writing it I ran into a couple of minor issues.  One, is that sometimes a blockchain data file will just run out of data, the rest of the file containing zeros.  I don't know if this is normal or expected, but I treat it as an end-of-file condition.  

The other is that some blocks contain less data than is indicated by the block length; meaning that after all of the transactions are read in, the file pointer has not advanced as far as the 'block length' which was indicated.  I am going to assume this is normal and expected?

This code snippet parses my copy of the blockchain (9.2gb) in roughly 95 seconds; which I figured was pretty good; though I don't know what I have to compare to.

At any rate, if anyone finds this code snippet useful or just wants to better understand the data layout of the bitcoin blockchain you can find it here:

http://codesuppository.blogspot.com/2013/07/a-bitcoin-blockchain-parser-as-single.html

Feedback, bugfixes, suggestions, all welcome.

Thanks,
John

Hi! your tool is definitely usefull! however since blockchain become huge (~30 GB) and more then 52 mln addresses and 200+ inputs outputs - tool is no longer working ( moreover it`s not possible to recompile it changeing just:

#define MAX_BITCOIN_ADDRESSES 48000000 // 40 million unique addresses.
#define MAX_TOTAL_TRANSACTIONS 90000000 // 40 million transactions.
#define MAX_TOTAL_INPUTS 268000000 // 200 million inputs.
#define MAX_TOTAL_OUTPUTS 268000000 // 200 million outputs
#define MAX_TOTAL_BLOCKS 600000      // 1/2 million blocks.

because of it leads to Warning   16   warning C4307: '*' : integral constant overflow

these are the highest numbers tool is still compile-able however
it won`t let you poarse blockchain after.

It would be much appreciated if you can recompile it to be compatible with current blockchain size! or tips on how to do it )))
Thank you.

full member
Activity: 194
Merit: 100

Can this be used to compute the bitcoin rich list (list of all addresses with balance greater than x)?
member
Activity: 72
Merit: 10
is this topic alive? i'm trying to compile your software with gcc 4.5 on OS X Mavericks, but I get these errors:

Code:
$ gcc *.cpp -o blockchain.out
BlockChain.cpp:2619:11: warning: enumeration value 'AM_LAST' not handled in switch [-Wswitch]
        switch ( m )
                 ^
1 warning generated.
main.cpp:320:116: warning: format specifies type 'int' but the argument has type 'float' [-Wformat]
  ...%d bitcoin addresses with a balance of more than %d bitcoins.\r\n", tcount, mMinBalance );
                                                      ~~                         ^~~~~~~~~~~
                                                      %f
main.cpp:334:109: warning: format specifies type 'int' but the argument has type 'float' [-Wformat]
  ...oldest bitcoin addresses with a balance of more than %d bitcoins.\r\n", tcount, mMinBalance );
                                                          ~~                         ^~~~~~~~~~~
                                                          %f
main.cpp:348:124: warning: format specifies type 'int' but the argument has type 'float' [-Wformat]
  ...older than %d days with a balance of more than %d bitcoins.\r\n", zdays, mMinBalance );
                                                    ~~                        ^~~~~~~~~~~
                                                    %f
main.cpp:418:17: warning: enumeration value 'SR_LAST' not handled in switch [-Wswitch]
                                                        switch ( mStatResolution )
                                                                 ^
main.cpp:394:12: warning: enumeration values 'CM_NONE' and 'CM_EXIT' not handled in switch [-Wswitch]
                switch ( mMode )
                         ^
5 warnings generated.
Undefined symbols for architecture x86_64:
  "std::__1::__vector_base_common::__throw_length_error() const", referenced from:
      void std::__1::vector >::__push_back_slow_path(BlockChainAddresses::StatAddress const&) in BlockChainAddresses-837c2d.o
  "std::terminate()", referenced from:
      ___clang_call_terminate in BlockChain-69c2db.o
      ___clang_call_terminate in BlockChainAddresses-837c2d.o
      ___clang_call_terminate in main-b5bd23.o
  "vtable for __cxxabiv1::__class_type_info", referenced from:
      typeinfo for BlockChain in BlockChain-69c2db.o
      typeinfo for QuickSortPointers in BlockChain-69c2db.o
      typeinfo for BlockChainAddresses in BlockChainAddresses-837c2d.o
  NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
  "vtable for __cxxabiv1::__si_class_type_info", referenced from:
      typeinfo for BlockChainImpl in BlockChain-69c2db.o
      typeinfo for SortByAge in BlockChain-69c2db.o
      typeinfo for SortByBalance in BlockChain-69c2db.o
      typeinfo for BlockChainAddressesImpl in BlockChainAddresses-837c2d.o
  NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
  "operator delete[](void*)", referenced from:
      SimpleHash::init() in BlockChain-69c2db.o
      BlockChainImpl::~BlockChainImpl() in BlockChain-69c2db.o
      BitcoinTransactionFactory::printOldest(unsigned int, float) in BlockChain-69c2db.o
      BitcoinTransactionFactory::printTopBalances(unsigned int, float) in BlockChain-69c2db.o
      SimpleHash::init() in BlockChain-69c2db.o
      BitcoinTransactionFactory::saveStatistics(bool, float) in BlockChain-69c2db.o
      BitcoinTransactionFactory::saveAddressesOverTime() in BlockChain-69c2db.o
      ...
  "operator delete(void*)", referenced from:
      createBlockChain(char const*) in BlockChain-69c2db.o
      BlockChainImpl::~BlockChainImpl() in BlockChain-69c2db.o
      createBlockChainAddresses(char const*) in BlockChainAddresses-837c2d.o
      BlockChainAddressesImpl::~BlockChainAddressesImpl() in BlockChainAddresses-837c2d.o
      std::__1::__vector_base >::~__vector_base() in BlockChainAddresses-837c2d.o
      std::__1::__split_buffer&>::~__split_buffer() in BlockChainAddresses-837c2d.o
      BlockChainAddresses::~BlockChainAddresses() in BlockChainAddresses-837c2d.o
      ...
  "operator new[](unsigned long)", referenced from:
      BlockChainImpl::buildBlockChain() in BlockChain-69c2db.o
      SimpleHash::init() in BlockChain-69c2db.o
      BitcoinTransactionFactory::printOldest(unsigned int, float) in BlockChain-69c2db.o
      BitcoinTransactionFactory::printTopBalances(unsigned int, float) in BlockChain-69c2db.o
      SimpleHash::init() in BlockChain-69c2db.o
      BitcoinTransactionFactory::saveStatistics(bool, float) in BlockChain-69c2db.o
      BitcoinTransactionFactory::saveAddressesOverTime() in BlockChain-69c2db.o
      ...
  "operator new(unsigned long)", referenced from:
      createBlockChain(char const*) in BlockChain-69c2db.o
      createBlockChainAddresses(char const*) in BlockChainAddresses-837c2d.o
      std::__1::__split_buffer&>::__split_buffer(unsigned long, unsigned long, std::__1::allocator&) in BlockChainAddresses-837c2d.o
  "___cxa_begin_catch", referenced from:
      ___clang_call_terminate in BlockChain-69c2db.o
      ___clang_call_terminate in BlockChainAddresses-837c2d.o
      ___clang_call_terminate in main-b5bd23.o
  "___cxa_pure_virtual", referenced from:
      vtable for QuickSortPointers in BlockChain-69c2db.o
      vtable for BlockChain in BlockChain-69c2db.o
      vtable for BlockChainAddresses in BlockChainAddresses-837c2d.o
  "___gxx_personality_v0", referenced from:
      createBlockChain(char const*) in BlockChain-69c2db.o
      BlockChainImpl::BlockChainImpl(char const*) in BlockChain-69c2db.o
      BlockChainImpl::~BlockChainImpl() in BlockChain-69c2db.o
      SimpleHash::init() in BlockChain-69c2db.o
      BlockChainImpl::~BlockChainImpl() in BlockChain-69c2db.o
      SimpleHash::init() in BlockChain-69c2db.o
      BitcoinTransactionFactory::gatherStatistics(unsigned int, unsigned int, bool) in BlockChain-69c2db.o
      ...
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
member
Activity: 82
Merit: 10
Yes, I use the STL a lot, and mostly for that reason, because it's 'standard'.  However, I also try to avoid using it if I don't have to.  I won't go and write my own container class when the same container class already exists in STL but, on the other hand, if I can get away without using a container class at all, that's fine too.  I really try to keep it out of public API's if at all possible.  It just makes the code harder to follow I believe; so I typically might use it for internal implementation code but keep the public facing API as simple as possible.

The problems with the STL abound though, and they have been discussed a great deal online.  At my professional work, we have an internal standard not to use the STL.  Instead we have our own customized templates which we have highly optimized.

Our code here has to run on the following platforms:

Win32/Win64
Linux (many flavors)
Wii
Wii-U
Xbox-360
Xbox-720
PS3
PS4
Android
IOS

It's pretty crazy having to target high performance physics code on all of those platforms at the same time.

We also have a requirement that we capture every single memory allocation ever performed and never use any exception handling anywhere in the code base.

Game engines have some pretty high performance requirements, and the STL and exception handling usually don't fit this pattern.

That said, for my open source work, I always use the STL if I need a container.

John
member
Activity: 82
Merit: 10
Me communicate not so good.

I haven't reimplemented *any* cryptographic algorithms.  All I did was copy the two cryptographic routines I needed, and maybe slightly refactored the interface to them.

I just want the code I need and not the code I don't need and in a form which will compile without dragging in dependencies on hundreds of other source files and libraries (i.e. STL, BOOST, networking code, etc.)

To repeat, I haven't reimplemented any cryptographic algorithm, I've just tried to organize the source so I can easily build and understand it.

As I said before, if SHA256 were part of the C99 standard and built into every C/C++ compiler, that would be fine. But, it's not.

And I get that you guys have no problem building things like 'opensll' just because you need access to one or two routines; but that's not really my coding style.  By pulling out just the two routines I need into a standalone code snippet I am not 'reimplementing' anything, I'm just cleaning up an existing implementation to make it more usable and accessible and easier to understand.

After running CLOC on openSSL it reports a total of 1,589 files; 946 C files, 169 PERL scripts, 262 C/C++ header files, 76 makes files, 13 assembly files, 61 shell scripts, and a total of 234,537 lines of C code and 69,175 lines of code in header files.

In contrast, the code for *just* SHA256 and RIPMD160 the only two hash routines needed to parse the blockchain (which was my project goal) are 750 lines of code.

If you want to add a quarter million line source code project to your little utility program just so you can access 750 lines of code, go for it.  But, as I said, that's not my style.

John

P.S. For what it's worth, cryptopp doesn't seem as bad.  It's 270 source files and 'only' 56,000 lines of source.
legendary
Activity: 1596
Merit: 1100
I'm a senior software engineer and I have been working on computer games and multi-platform development since 1979.

I just really prefer small code bases with minimal dependencies that are compiler and operating system agnostic.

There is little engineering upside to reimplementing highly complex cryptographic algorithms on your own, given the levels of engineering review and crypt-analysis of your own codebase versus an existing crypto lib.

It might be personally satisfying, but it makes little sense unless you are truly an expert crypto mathematician.



legendary
Activity: 1890
Merit: 1086
Ian Knowles - CIYAM Lead Developer
I just really prefer small code bases with minimal dependencies that are compiler and operating system agnostic.

I do agree with the "minimal" approach although the STL and exceptions are *standard* C++ (if your compiler doesn't support these then it is not a C++ compiler at all but probably one of those horrible non-standard EC++ compilers from the 90's).

BTW if you want a fairly minimal SHA256 (one source and one header) then there is one in the CIYAM project here: https://github.com/ciyam/ciyam/blob/master/src/sha256.cpp and here: https://github.com/ciyam/ciyam/blob/master/src/sha256.h.
legendary
Activity: 1246
Merit: 1002

I'm a senior software engineer and I have been working on computer games and multi-platform development since 1979.

I just really prefer small code bases with minimal dependencies that are compiler and operating system agnostic.

John


+1

Make it as simple as possible, but not simpler.
member
Activity: 82
Merit: 10
Let me reform the question using your analogy.

Let's say I'm working on a project and I need a log function.  Now, let's suppose that the log function (like SHA256) is not part of the C99 standard and built into the set of default libraries available in all compatible C++ compilers.

So, what if the only way to get access to a 'log' function is that I have to download some math library comprising 100,000 lines of source code with dependencies on streams, network code, a specific compiler, a complex build and configuration requirement, BOOST, the STL, exception handling, and god knows what else.

Because I needed access to 'log'.

Given the choice between dragging that into my code base or just copying only the 'log' function source code I would choose the much simpler solution.

People don't want to learn about algorithms by downloading 100,000 lines of source code libraries.

That's my position, and when I check with many of my colleagues in the computer game industry (where I come from) that tends to be our preference.

I'm a senior software engineer and I have been working on computer games and multi-platform development since 1979.

I just really prefer small code bases with minimal dependencies that are compiler and operating system agnostic.

John
member
Activity: 82
Merit: 10
Yeah, I know all about hashes. My issue isn't about any particular hash algorithm, my issue is having to download and build a massive library just to get access to the *one* I need.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
Well, I'm not trying to write a full client, just something that can extract all of the transactions.

For that, as far as I can tell, I just need the two hash routines. In fact I think I'm getting pretty close to finishing this thing up.

Also, to be clear my coding website is largely educational. I'm writing this code to educate myself and releasing it open source to educate others.
Thanks,

John

Don't get caught up going down tangent paths.  Your issues with hashing sound much like someone who's never seen a logarithm before.  Reimplementing the log() function in C++ is not very useful nor educational, unless your goal is compiler optimizations or embedded software.  It's more important to understand the purpose and properties of the logarithm (hashing), than to distract those you are trying to educate with unnecessary details.  If I told you I didn't know what a logarithm but I needed it for my financial software, and I was working on implementing my own version of logarithm, you'd probably have the same reaction:  "go look up what a logarithm does and why it's important, and then just #include and focus on which properties of it are needed by the financial software and why."   

Hashing is the one of the most fundamental and important atoms of cryptographic systems.  Especially Bitcoin.  And implementing it from scratch is not likely to give you any insight into how it works or how it fits into Bitcoin.   What you need to understand is that it's a function that takes any input, completely mangles it, and produces a seemingly-random X-byte output.  It must always produce the same answer for identical input, but change any bit should look equivalent to producing a random value.   It is intentionally non-invertible (i.e. SHA256 can take in any size string and always spits out 32 bytes), and each output bit should have no correlation with any input bit. 

In the context of Bitcoin transactions, it's goal is to produce a unique 32-byte identifier for a transaction, regardless of how many kB it is.  If any byte of the transaction is changed, the hash will look like a different 32-byte random number.  In the context of headers, this "randomness" to create a sort of lottery for miners.  Since we know the hash function should produce essentially-uniform randomness, we know that for a given difficulty and hashrate on the network, it should take approx 10 min on average for someone to find a lucky hash.

Similarly, if you go any further with this, you're going to need an ECDSA library in addition to hashing.  While elliptic curves are mathematically neat, you would be best to not reimplement it yourself unless you are working on speed-optimizing it.  If you think publickey->address was hard, try privatekey->publickey! (no, don't try it, just get a library and understand the properties of it).



member
Activity: 82
Merit: 10
I did one more cleanup pass/revision before the end of today.

There is now a code snippet which does just base58 encoding.  It's based on the source code in 'cbitcoin' but I removed all memory allocation.

I also revised the BitcoinAddress snippet to go from public-key to bitcoin binary address, and from a bitcoin binary address to ASCII and from ASCII back to binary (checked).

Finally, my original snippet folds in all of this code to parse the bulk of the blockchain.  I still haven't yet assembled all of the individual transactions, I would do that the next time I find some time to spend on this effort.

All that said, I think these individual snippets, which are pretty well documented, will hopefully prove useful educational tools for future developers who won't have to go through the same trouble I went through figuring out how all of these pieces fit together.

Like a previous poster said, I prefer looking at source code as documentation than a Wiki page; so hopefully others will find this useful too.

http://codesuppository.blogspot.com/2013/07/bitcoin-code-snippets.html
Pages:
Jump to: