Pages:
Author

Topic: A bitcoin blockchain parser in a few (thousand) lines of C++ - page 2. (Read 16979 times)

member
Activity: 82
Merit: 10
Well, I'm not trying to write a full client, just something that can extract all of the transactions.

For that, as far as I can tell, I just need the two hash routines. In fact I think I'm getting pretty close to finishing this thing up.

Also, to be clear my coding website is largely educational. I'm writing this code to educate myself and releasing it open source to educate others.
Thanks,

John
Thanks, I prefer source too.

On further thought I have decided I'm going to retitle this post as a block-chain parser in a couple thousand lines of code; but it will still be just one CPP file when I'm done.

Actually it was kind of hard, it took me hours just to gather together the code to do SHA-256, RPMD-160, and base58cheked without dragging in massive libraries and dependencies.

I'm going to make a new Google code project called 'bitcoin-snippets' which will contain just the source to do each one of these specific operations.

Thanks,

John

Don't forget this is a crypto currency.  There is no way around having a general purpose crypto library available if you're going to do any coding in Bitcoin.  Just like you have standard headers in C++ for various data structures and algorithms, you are going to need these crypto algorithms available.   You're wasting your time reimplementing it, because it's all actually very simple, standard crypto operations (just combined in a creative way).  Once you have those standard operations available and you understand what they do, much of this will be much simpler.  Hashing is the bread and butter of Bitcoin, just get the libraries and do it.

And no one ever said bitcoin programming was easy, but you're taking the right size steps and you will get there with some patience.  What you are doing is exactly how I started two years ago and I think it's a fantastic way to learn.  You just have to appreciate that there's a lot to learn and you're going to spend a ton of time confused and digging for clarity.   That's part of the fun :-)


legendary
Activity: 2053
Merit: 1356
aka tonikt
It's true.
Depends how far you want to go, you would eventually also need ECDSA - which is much bigger piece of code than the hashing funcs and the big numbers..
So if you really care to have a small code, you will eventually prefer to link it against an external crypto lib, instead of turnign your block parser into a crypto lib.
Probably openssl is best for your purposes, since this is the most common library having everything that a blockchain parser needs.
And you can just download a built lib, for any platform/compiler you need.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
Thanks, I prefer source too.

On further thought I have decided I'm going to retitle this post as a block-chain parser in a couple thousand lines of code; but it will still be just one CPP file when I'm done.

Actually it was kind of hard, it took me hours just to gather together the code to do SHA-256, RPMD-160, and base58cheked without dragging in massive libraries and dependencies.

I'm going to make a new Google code project called 'bitcoin-snippets' which will contain just the source to do each one of these specific operations.

Thanks,

John

Don't forget this is a crypto currency.  There is no way around having a general purpose crypto library available if you're going to do any coding in Bitcoin.  Just like you have standard headers in C++ for various data structures and algorithms, you are going to need these crypto algorithms available.   You're wasting your time reimplementing it, because it's all actually very simple, standard crypto operations (just combined in a creative way).  Once you have those standard operations available and you understand what they do, much of this will be much simpler.  Hashing is the bread and butter of Bitcoin, just get the libraries and do it.

And no one ever said bitcoin programming was easy, but you're taking the right size steps and you will get there with some patience.  What you are doing is exactly how I started two years ago and I think it's a fantastic way to learn.  You just have to appreciate that there's a lot to learn and you're going to spend a ton of time confused and digging for clarity.   That's part of the fun :-)

member
Activity: 82
Merit: 10
Thanks, I prefer source too.

On further thought I have decided I'm going to retitle this post as a block-chain parser in a couple thousand lines of code; but it will still be just one CPP file when I'm done.

Actually it was kind of hard, it took me hours just to gather together the code to do SHA-256, RPMD-160, and base58cheked without dragging in massive libraries and dependencies.

I'm going to make a new Google code project called 'bitcoin-snippets' which will contain just the source to do each one of these specific operations.

Thanks,

John
legendary
Activity: 2053
Merit: 1356
aka tonikt
Good job.
It wasn't that hard after all, was it?

In many cases you will also want to decode the address from txout, while you do not know the public key yet.
I said before that it isn't really possible to represent each possible out script as an address, but these days most (like 99%) of the output scripts follow a standard pattern, that is:
Code:
0x76 0xa9 0x14 <20-bytes-of-addr's-RIMED> 0x88 0xAC

So from the 20 bytes you can already figure out address where the money is being sent to, yet without having its public key (that only comes later, if this output is being spent).

I once made a tool that is able to decode a raw tx data and dump the output addresses.
It needs openssl to link, but maybe you will find the code itself educational - some people prefer reading source code, over reading a wiki. Smiley
Code:
/*
gcc bctrans.c -o bctrans.exe -lcrypto -I /local/ssl/include -L /local/ssl/lib
*/

#include
#include
#include
#include
#include
#include


static unsigned char addr_version = 0x00;
static FILE *f = NULL;

BIGNUM *bn58, dv, mo;
BN_CTX *ctx;


#define SHA256(p,l,o) { \
SHA256_CTX shactx; \
SHA256_Init(&shactx); \
SHA256_Update(&shactx, (p), (l)); \
SHA256_Final((o), &shactx); }


void readfile(unsigned char *p, int len) {
if (!f) {
int c, i;
char s[3];
while (len>0) {
for (i=0;i<2;) {
c = getchar();
if (c==EOF) {
fprintf(stderr, "File too short\n");
exit(1);
}
c = tolower(c);
if (c<='9' && c>='0' || c<='f' && c>='a')  s[i++] = (char)c;
}
s[2] = 0;
sscanf(s, "%x", &c);
*p = (unsigned char)c;
p++;
len--;
}
} else {
if (fread(p, 1, len, f)!=len) {
fprintf(stderr, "File too short\n");
fclose(f);
exit(1);
}
}
}

unsigned long long getle(unsigned char *p, int bytes) {
unsigned long long res=0;
while (bytes--) {
res|= ((unsigned long long)(p[bytes]))<<(8*bytes);
}
return res;
}

unsigned long long getvl() {
unsigned char b[8];
readfile(b, 1);
switch (*b) {
case 0xfd:
readfile(b, 2);
return getle(b, 2);
case 0xfe:
readfile(b, 4);
return getle(b, 4);
case 0xff:
readfile(b, 8);
return getle(b, 8);
}
return *b;
}


void prhash(unsigned char *p, unsigned int l) {
while (l--) printf("%02x", p[l]);
}


void hexdump(unsigned char *p, unsigned int l) {
while (l--) printf("%02x", *p++);
}


void printbtcaddr(unsigned char *p) {
static const char *chrs = "123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz";
unsigned char mod;
char out[64];
int i=0, j=0;
BIGNUM *bn = BN_bin2bn(p, 25, NULL);
while (!BN_is_zero(bn)) {
BN_div(&dv, &mo, bn, bn58, ctx);
if (BN_bn2bin(&mo, &mod)==0)  mod = 0;
out[i++] = chrs[mod];
BN_copy(bn, &dv);
}
BN_free(bn);
while ((*p)==0) {
putchar('1');
p++;
}
while (i--)  putchar(out[i]);
}


int main(int argc, char * argv[]) {
static unsigned char buf[0x10000];
unsigned long long i, sl, txcnt, v;
unsigned va, vb;
int x;
long fpos;
char *fname = NULL;

for (x=1; x if (strcmp(argv[x], "-t")==0) {
addr_version = 0x6f; // testnet
} else {
fname = argv[x];
}
}

if (!fname) {
printf("Enter transactions hexdump data:\n");
} else {
f = fopen(fname, "rb");
if (!f) {
fprintf(stderr, "File %s not found\n", fname);
return 1;
}
}

readfile(buf, 4);
printf("Version: %llu\n", getle(buf, 4));

txcnt = getvl();
printf("TX IN cnt: %llu\n", txcnt);
for (i=0; i readfile(buf, 36);
sl = getvl();

printf("%6d) : ", (int)i);
prhash(buf, 32);
printf(" Idx=%2lld  sl=%lld", getle(buf+32, 4), sl);
readfile(buf, sl);
readfile(buf, 4);

printf(" seq=%x\n", (unsigned)getle(buf, 4));
}

txcnt = getvl();
printf("TX OUT cnt: %llu\n", txcnt);

ctx = BN_CTX_new();
bn58 = BN_bin2bn("\x3a", 1, NULL);
BN_init(&dv);
BN_init(&mo);

for (i=0; i readfile(buf, 8);
sl = getvl();
v = getle(buf, 8);
va = (unsigned)(v/100000000LL);
vb = (unsigned)(v%100000000LL);
printf("%6d) : %7u.%08u BTC", (int)i, va, vb, sl);
readfile(buf, sl);
if (sl!=25 || memcmp(buf, "\x76\xa9\x14", 3) || buf[23]!=0x88 || buf[24]!=0xac) {
printf("  WARNING! Unexpected SIG_SCRIPT:\n"); hexdump(buf, sl);
} else {
unsigned char dat[25];
unsigned char sha[SHA256_DIGEST_LENGTH];
unsigned char sha2[SHA256_DIGEST_LENGTH];
dat[0] = addr_version; // version
memcpy(dat+1, buf+3, 20);
SHA256(dat, 21, sha);
SHA256(sha, SHA256_DIGEST_LENGTH, sha2);
//printf("  chsum:%02x%02x%02x%02x", sha2[0], sha2[1], sha2[2], sha2[3]);
memcpy(dat+21, sha2, 4);
printf("  to address "); printbtcaddr(dat);
}
putchar('\n');
}

BN_free(bn58);
BN_CTX_free(ctx);

readfile(buf, 4);
printf("Lock Time: %llu\n", getle(buf, 4));

if (f) {
fpos = ftell(f);
fseek(f, 0, SEEK_END);
if (fpos!=ftell(f)) {
printf("WARNING!!! File too long. Only %ld bytes expected (%ld too many)\n",
fpos, ftell(f)-fpos);
} else {
printf("File size checked OK - %ld bytes\n", fpos);
}
fclose(f);
}

return 0;
}
member
Activity: 82
Merit: 10
Wow, that was shockingly complicated.

For anyone in the future who ever has to/wants to convert a public key signature (65 byte binary embedded in most bitcoin scripts) to the ASCII representation of a bitcoin address, I have written a reference implementation that does this in a single CPP file.

I still plan to do some code cleanup on this, but it does work.

Basically, think of this as a reference implementation of what is described on this page https://en.bitcoin.it/wiki/Technical_background_of_version_1_Bitcoin_addresses

Here is the header file with quite a bit of corresponding documentation.

https://code.google.com/p/blockchain/source/browse/trunk/BitcoinAddress.h

Here is the implementation.  I still plan to do some code cleanup on this, but the main point is that you can go from the 65 byte public-key signature to the bitcoin ASCII address with no external dependencies to any other code.  

https://code.google.com/p/blockchain/source/browse/trunk/BitcoinAddress.cpp

I hope someone else finds this useful as it took me quite a bit of work to pull the necessary code to do this from multiple sources.

As I said, I still plan to do some cleanup on this code; mostly cleaning up the 'bignum' methods which are a little sloppy right now; it does completely unnecessary memory allocations and I want to remove all of that.  The hash routines are pretty clean already and, hopefully, work well on lots of platforms.

I've only built this code with Microsoft Visual Studio.  If there are compile errors with GCC, please let me know and I will revise it accordingly.

Seriously, if you dump this one CPP/H into a project and it fails to compile, I would like to know.

Thanks,

John
legendary
Activity: 2053
Merit: 1356
aka tonikt
Interesting, so how come on the block-explorer website it shows public-key information for all of the blocks?
Sorry, I don't understand this question.
member
Activity: 82
Merit: 10
Interesting, so how come on the block-explorer website it shows public-key information for all of the blocks?
legendary
Activity: 2053
Merit: 1356
aka tonikt
printing the address from txout is not something that you can really do in a reliable and ultimate way.
the outputs scripts can be anything. most of them follow the usual patters, but they don't have to really follow any limited number of patterns - in reality there is no such thing as output address of a bitcoin transaction, there is only an output script - if you can hexdump this one, for some people it would be as good as printing some address.
legendary
Activity: 1232
Merit: 1094
That would let us get back to hashing without forking the chain.  I think a chain fork is a very big step.

Hmm, this is kind of off-topic for the thread.  In the end, a non-fork option would mean that you need to match the POW done.

I was just getting ready to make it print out the public-key in ASCII when I found it that doing that is a giant pain in the ass.

You could print out in hex, as a start.

Quote
You need to have access to a 'big-number' library.  I refactored the version that is in 'cbitcoin' but it's still not producing a valid ASCII string for a given binary input address.  I'm not sure why, but it's late in the day and that's enough for now.

It is always worth checking that you have the endian-ness right.  Also, there are version numbers too.
member
Activity: 82
Merit: 10
For those of you interested in seeing the progress of this project; I just uploaded my current work.  The report of what's changed and what this update contains is located at this link on my personal coding blog.  

http://codesuppository.blogspot.com/2013/07/work-in-progress-on-bitcoin-blockchain.html

This afternoon I started focusing on getting my 'parser' to display the same data which shows up on the block-explorer website; and I made good progress on that.

I was just getting ready to make it print out the public-key in ASCII when I found it that doing that is a giant pain in the ass.

You need to have access to a 'big-number' library.  I refactored the version that is in 'cbitcoin' but it's still not producing a valid ASCII string for a given binary input address.  I'm not sure why, but it's late in the day and that's enough for now.

In keeping with my personal coding style/preference all of the code is in 'snippet' form.

That means it's got a single header file which presents a minimal public API and then a single CPP with the implementation of that API.  There are no external dependencies.

So far, I've got the main BlockChain parser in snippet form, also hashes for SHA256 and RIPEMD160, and a snippet that is supposed to convert a bitcoin address from ASCII to binary and from binary to ASCII, but it's not yet fully working.  I have the start of a bitcoin scripting engine; but it doesn't parse much yet.

Thanks for all the help, I don't know when I'll find time to keep working on this, but so far it's been fun.

John
legendary
Activity: 1246
Merit: 1002

If you are changing the client anyway, then you could just force a difficulty change. 

That is what some coins do after they lose lots of hashing power.

That would let us get back to hashing without forking the chain.  I think a chain fork is a very big step.

legendary
Activity: 1232
Merit: 1094
I don't know if he traded them or not.  If he did, I would want to have the miner's block payout to whatever address has coins derived from his.  That will take some analysis, which is how I ended up in this thread.  I have mixed feelings about not replacing his coins.

I also wanted to choose what times-tamps to use on the forked blocks, instead of "now."  I haven't read the client's source.

If you are changing the client anyway, then you could just force a difficulty change. 

That is what some coins do after they lose lots of hashing power.
legendary
Activity: 1246
Merit: 1002
I thought it might be interesting to do some experiments on that block chain, to go back to the next to last block at difficulty 8,000 and produce a fork.  I would attempt to repair the chain to the extent possible.  It should be possible to mine new blocks that honored previous payouts (or possibley not, in the case of the grief-miner)

That is a chain fork, since the new "official" chain would have lower POW.  If you are modifying the client, you could just put a blacklist check for the first high difficulty block.

Presumably, the miner traded his coins before departing?

It is a curse of alt coins that hashing is so variable.

I don't know if he traded them or not.  If he did, I would want to have the miner's block payout to whatever address has coins derived from his.  That will take some analysis, which is how I ended up in this thread.  I have mixed feelings about not replacing his coins.

I also wanted to choose what times-tamps to use on the forked blocks, instead of "now."  I haven't read the client's source.
legendary
Activity: 1232
Merit: 1094
I thought it might be interesting to do some experiments on that block chain, to go back to the next to last block at difficulty 8,000 and produce a fork.  I would attempt to repair the chain to the extent possible.  It should be possible to mine new blocks that honored previous payouts (or possibley not, in the case of the grief-miner)

That is a chain fork, since the new "official" chain would have lower POW.  If you are modifying the client, you could just put a blacklist check for the first high difficulty block.

Presumably, the miner traded his coins before departing?

It is a curse of alt coins that hashing is so variable.
legendary
Activity: 1246
Merit: 1002

Simple version:
It is possible at any point in time there are multiple conflicting blockchains.  The "longest"* chain is considered the consensus record.
* longest in this instance doesn't mean just highest block height (as that can be spoofed) it means the chain with highest aggregate difficulty.


I want to produce just such a spool in the bytecoin blockchain.  Someone mined blocks at the rate of 1 every minute or two at difficulty 8,000, and then quit mining when the difficulty climbed to 32,000.  Since that transition to higher difficulty on about May 5 (two months) only about 250 blocks have been mined.  This past week, 1 single block was mined.

I thought it might be interesting to do some experiments on that block chain, to go back to the next to last block at difficulty 8,000 and produce a fork.  I would attempt to repair the chain to the extent possible.  It should be possible to mine new blocks that honored previous payouts (or possibley not, in the case of the grief-miner)

donator
Activity: 1218
Merit: 1079
Gerald Davis
Older blocks are never rewritten but it is possible there are 2 or more competing chains.

Example the last block is block height 123

One miner solves a block with height height of 124 (it references the block hash 123 as the prior blockhash).  We will call this block 124a.
At roughly the same time a different miner solves a different block with a block height 124.  We will call this block 124b.

At this point the network is not in consensus. Some nodes believe block 124a is the next "valid" block and some believe 124b is.

Eventually some node will solve block 125.  If the block is built on the 124a chain then block 124b becomes orphaned. If it is built on the 124b chain then block 124a is orphaned.

If your node had received block 124a before it became orphaned it will be contained in your block history.  When block 124b orphans block 124a it will also be contained in your block history.  You can detect this because both blocks will have the block hash for block 123 as the prior block hash.

In your block parser you can built a simplified system to remove orphans by working backwards.  bitcoind will tell you the hash of the last block of the longest chain.  Working from that block you can locate the prior block hash field in the header and then locate the corresponding block, and work all the way back to the genesis blocks.  Any other blocks are orphans and for the purpose of parsing the historical blockchain can be pruned.


Simple version:
It is possible at any point in time there are multiple conflicting blockchains.  The "longest"* chain is considered the consensus record.
* longest in this instance doesn't mean just highest block height (as that can be spoofed) it means the chain with highest aggregate difficulty.


member
Activity: 82
Merit: 10
Ok, I believe I am unblocked now on finishing up my parser/transaction tool.  The piece that I was missing was the transaction hash.

Each input refers to a 'transactionHash' however there is no transaction hash stored in the blockchain .dat files.

After some google searching I discovered that the transactionHash is computed by running SHA256 on the raw transaction input buffer.  Through the magic of SHA256 no two transactions are presumed to ever produce identical hashes.  I don't need to know what 'block' the transaction is in, as the transaction data is a complete standalone dataload unto itself, unique in the world of bitcoin.

This was the missing piece of the puzzle for me because I didn't realize that I had to compute this hash when I parsed the input file, at least, if I ever had any hope of binding inputs to the correct transactions.

A couple of general questions.  Are transactions always written to the last block in the block chain?  In other words, old blocks never get re-written?  I would assume this is the case otherwise the blockchain would become fragmented and need garbage collection, etc.

Thanks,

John
legendary
Activity: 1232
Merit: 1094
Thanks, I looked at that before.  At first it seems close, but it has dependencies on openssl, some google hash map template code, and only builds and runs on unix.  I'm working on windows personally.

You need to be able to do sha256, if you want to work out hashes.  You could probably find code for just that.

OpenSSL is also used for all the other crypt stuff, but you don't need that.

You also need the base58 encoder to convert public keys (after hashing with RIPE-160) into bitcoin text addresses.
member
Activity: 82
Merit: 10
Thanks, I looked at that before.  At first it seems close, but it has dependencies on openssl, some google hash map template code, and only builds and runs on unix.  I'm working on windows personally.

I suspect I'm really close to having this all figured out.  It's not that I don't get the basic concepts of the transactions etc. I'm just unclear how some of these hashes are computed and how they relate to one another.

John
Pages:
Jump to: