Pages:
Author

Topic: A bitcoin blockchain parser in a few (thousand) lines of C++ - page 3. (Read 16960 times)

legendary
Activity: 2128
Merit: 1073
Does anyone know of a C/C++ program someone else has already written that does what I'm trying to do here; just walk the block-chain and track transactions?  All of these code bases have sooo...many dependencies that you dive into crytographic hell trying to decipher them.
Yes, user znort987 did just that. But then he couldn't stand the pressure, deleted the key posts and quit participating in the forum.

His thread starts here:

https://bitcointalksearch.org/topic/--88584

And the source code is still available:

https://github.com/znort987/blockparser

You may want to check his thread and his post to understand a bit of what had happened and avoid any possibility of future stress or nervous breakdown.
donator
Activity: 1218
Merit: 1079
Gerald Davis
Yeah, I'm starting to no longer have fun doing this.  The relationship between the inputs and outputs are about as clear as mud.  Well, let me be clear, I get that each output needs an input, but hell if I know how to figure out which outputs refer to which inputs.  Just a bit ago, I wanted to figure out how to convert the binary representation of a public-key to ASCII and even that it had a log of dependencies; even the 'no library dependency version' depended on some network thing called 'htonl' and on an encryption routine as well.  That's a lot of code drug in just to convert some binary to ASCII.

They don't.  That might be why you are confused.

The only relationship between inputs and outputs in a tx is is that the sum of the inputs must be equal to or greater than the sum of the outputs.  The tx fee is the difference between the sum of the inputs and the sum of the outputs.

So some rules to get you started.

1) The inputs of all txs (except coinbase = generation) are the output of prior transactions.
2) There is a 1:1 relationship between the INPUT of a given transaction and the output of a PRIOR transaction.
3) In a given transaction there is no relationship between the inputs and outputs.
4) For a tx to be valid the sum of the inputs must be equal to or greater than the sum of the outputs.
5) The difference between inputs and outputs is the tx fee.
6) Since a tx can have n inputs and m outputs we refer to a specific output by both the tx hash AND the index.



Quote
but converting that raw date into paired input/output transactions seems to be a bit of a mystery to me so far.

It will forever remain a mystery no matter how many libraries you use until you shake that flawed model.


Lets look at a random transaction.

http://blockexplorer.com/tx/a6d9c176ecb041c2184327b8375981127f3632758a7a8e61b041343efc3bcb6e

In raw form
Quote
{
  "hash":"a6d9c176ecb041c2184327b8375981127f3632758a7a8e61b041343efc3bcb6e",
  "ver":1,
  "vin_sz":1,
  "vout_sz":2,
  "lock_time":0,
  "size":257,
  "in":[
    {
      "prev_out":{
        "hash":"b5045e7daad205d1a204b544414af74fe66b67052838851514146eae5423e325",
        "n":0
      },
      "scriptSig":"304402200e3d4711092794574e9b2be11728cc7e44a63525613f75ebc71375f0a6dd080d02202ef 1123328b3ecddddb0bed77960adccac5bbe317dfb0ce149eeee76498c19b101 04a36b5d3b4caa05aec80752f2e2805e4401fbdbe21be1011dc60c358c5fc4d3bedd1e03161fb4b 3a021c3764da57fee0d73570f3570f1b3dd92a1b06aae968846"
    }
  ],
  "out":[
    {
      "value":"300.00000000",
      "scriptPubKey":"OP_DUP OP_HASH160 0331e5256416bc11ecf9088091f8424819553a10 OP_EQUALVERIFY OP_CHECKSIG"
    },
    {
      "value":"699.99950000",
      "scriptPubKey":"OP_DUP OP_HASH160 4186719d739ae983d8c75a0cb82958e94b7ae81e OP_EQUALVERIFY OP_CHECKSIG"
    }
  ]
}


Now the tx hash is a6d9c176ecb041c2184327b8375981127f3632758a7a8e61b041343efc3bcb6e
The # of inputs is 1 (vin_sz).
The # of outputs is 2 (vout_sz).

The single input of the transaction is referenced here:  
Quote
"in":[
    {
      "prev_out":{
        "hash":"b5045e7daad205d1a204b544414af74fe66b67052838851514146eae5423e325",
        "n":0
      },
      "scriptSig":"304402200e3d4711092794574e9b2be11728cc7e44a63525613f75ebc71375f0a6dd080d02202ef 1123328b3ecddddb0bed77960adccac5bbe317dfb0ce149eeee76498c19b101 04a36b5d3b4caa05aec80752f2e2805e4401fbdbe21be1011dc60c358c5fc4d3bedd1e03161fb4b 3a021c3764da57fee0d73570f3570f1b3dd92a1b06aae968846"
    }

The INPUT for this transaction is the OUTPUT of a PREVIOUS transaction.
Specifically it is the tx with hash b5045e7daad205d1a204b544414af74fe66b67052838851514146eae5423e325.  However a tx can have multiple outputs so how do we know which one of those otuptuts this input refers to?  Simple the tx output index is provided.

Inputs of transactions are outputs of prior transactions.

In this case the input we are looking for is is specifically the output #0 of transaction hash b5045e7daad205d1a204b544414af74fe66b67052838851514146eae5423e325

So lets look at that transaction.  The output #0 of this transaction is 1,000 BTC.
http://blockexplorer.com/tx/b5045e7daad205d1a204b544414af74fe66b67052838851514146eae5423e325

So back to our current transaction (hash ending e325).  There are only 1 input.  The value of the transaction is the sum of the inputs.  In this case a single input of 1,000 BTC.

Now looking at the output section we can see we the two outputs
 "out":[
    {
      "value":"300.00000000",
      "scriptPubKey":"OP_DUP OP_HASH160 0331e5256416bc11ecf9088091f8424819553a10 OP_EQUALVERIFY OP_CHECKSIG"
    },
    {
      "value":"699.99950000",
      "scriptPubKey":"OP_DUP OP_HASH160 4186719d739ae983d8c75a0cb82958e94b7ae81e OP_EQUALVERIFY OP_CHECKSIG"
    }

Output#0 is 300 BTC and output #1 is 699.9995 BTC.
The sum of the outputs is 999.9995 BTC
The sum of the inputs is 1,000.0000 BTC
The tx fee to miners is the difference or 0.0005 BTC.

If you want to decode further which addresses a transaction is sent to well you will need more code.

Lets look at the second output.
Quote
OP_DUP OP_HASH160 4186719d739ae983d8c75a0cb82958e94b7ae81e OP_EQUALVERIFY OP_CHECKSIG

So what is 4186719d739ae983d8c75a0cb82958e94b7ae81e?  It is the RIPEMD160 hash of the public key.  However Bitcoin addresses include a checksum and are converted to Base58.

BTW the public key hash 4186719d739ae983d8c75a0cb82958e94b7ae81e is Bitcoin address 16yTynjmSe5bsRGykDaaCL5bm2pxiEfcqP.

legendary
Activity: 2053
Merit: 1356
aka tonikt
Oh, don't be a pussy.
This is exactly where the fun starts.

It all can be done - maybe easier than you imagine ATM.
As someone had said here, you can just keep the unspent outputs in memory, in maps. Hashmaps are standard in C++.
And sha256 function is a simple code that you can include in your sources.
You don't really need any dependencies, if you like writing code Smiley

And no, output does not need an input - only input needs an output.
So you only need to keep track of unspent outputs, in order to be able to 'spend' them by processing further inputs in the block chain.
It's not a really complex problem to solve.
member
Activity: 82
Merit: 10
Yeah, I'm starting to no longer have fun doing this.  The relationship between the inputs and outputs are about as clear as mud.  Well, let me be clear, I get that each output needs an input, but hell if I know how to figure out which outputs refer to which inputs.  Just a bit ago, I wanted to figure out how to convert the binary representation of a public-key to ASCII and even that it had a log of dependencies; even the 'no library dependency version' depended on some network thing called 'htonl' and on an encryption routine as well.  That's a lot of code drug in just to convert some binary to ASCII.

I still do not believe that this needs to be that difficult, but everyone seems to start by including massive libraries like CrypoPP and boost, networking code, etc. so it's not much fun trying to dig through this stuff.

I like to write code with as few dependencies as humanly possible and often in what I call 'snippet' form; one header file and one CPP which demonstrates how to do just a single thing.  But I can't find anything like that, instead it's these massive libraries which build on top of each other in layers so that breaking it out into 'snippet' form is a major task.

I also think that the code snippet I started with; which reads each block into memory, is valuable and useful, but converting that raw date into paired input/output transactions seems to be a bit of a mystery to me so far.

[*Edit*]

I may not give up just yet.  I have some time this weekend and I finally was able to get BitcoinArmory 'cppForSwig' to build and run on my machine parse the bitcoin blockchain on my hard-drive.  I had to change a few things to get it to work; but it is working now. Hopefully I will be able to unravel the mystery of inputs-outputs and transactions by stepping through this code.

It does use CryptoPP but perhaps I can strip out just the (hopefully) handful of routines which are actually used.

John
legendary
Activity: 2053
Merit: 1356
aka tonikt
so here is where you come to the point where you understand that you need to build and keep a database of unspent txouts - a big one.
and also you will need sha256, each time before putting a record into it Smiley

EDIT: let my know how many lines of code will your parser have, at the end Smiley
legendary
Activity: 1232
Merit: 1094
In the transaction header there are (n1) input scripts and (n2) output scripts; how do I know which input script to run on the virtual-machine prior to running a particular output script?

No, you run the input script from the spending transaction and then the output script from the source transaction.

If transaction 123456789 input 4 references transaction 987654321, output 7, then those are the 2 scripts you have to use together.

Also, if they are standard transactions, you don't even need to bother, the address will be in a constant location.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
>>You run the script for the input and then run the script for the output.

Ok, I had previously asked in this thread how the input and output scripts related to each other and I did not get a clear answer.


In the transaction header there are (n1) input scripts and (n2) output scripts; how do I know which input script to run on the virtual-machine prior to running a particular output script?

Thanks,

John

In a given transaction they are unrelated.   For each TxIn you look at its OutPoint and go fetch the TxOut it references from a previous transaction in the block database.  You use that TxOut.  The TxOut in this Tx will be used as inputs in a future transaction (when the recipient spends the money)
legendary
Activity: 2053
Merit: 1356
aka tonikt
In the transaction header there are (n1) input scripts and (n2) output scripts; how do I know which input script to run on the virtual-machine prior to running a particular output script?
You take output script from a different transaction - the one pointed by txid+vout
member
Activity: 82
Merit: 10
>>You run the script for the input and then run the script for the output.

Ok, I had previously asked in this thread how the input and output scripts related to each other and I did not get a clear answer.


In the transaction header there are (n1) input scripts and (n2) output scripts; how do I know which input script to run on the virtual-machine prior to running a particular output script?

Thanks,

John
legendary
Activity: 1232
Merit: 1094
I guess I'm still a little confused on that point.  I started writing a script parser, but the scripts I try to run against it are all kind of incomplete.  The input scripts generally just push data on the stack but have no actual instructions to 'do' anything.

The output scripts tend to execute an instruction such as 'OP_CHECKSIG' which requires that two items are on the stack, but when I get here there is usually only one item on the stack; and I'm on clear on what was supposed to have been executed prior to it.

That is intentional.

Each transaction has half of the script.

You run the script for the input and then run the script for the output.

The standard address scripts are:

Output: OP_DUP OP_HASH160 OP_EQUALVERIFY OP_CHECKSIG
Input: scriptSig:

The input is run and then the output.

So, the output says "Anyone who can setup the stack so that this script verifies can spend this coin".

The full instuctions are

OP_DUP: Copy the top of the stack
OP_HASH160: Apply RIPE-160 to the top of the stack
: Push that value onto the stack
OP_EQUALVERIFY: Make sure the top 2 values on the stack are the same (and pop both)
OP_CHECKSIG: Check the signature

If you run the input, it pushes 2 values onto the stack



This sets up a successful run for the output

OP_DUP:
OP_HASH160: RIPE-160()
: RIPE-160()
OP_EQUAL_VERIFY:   (Also checked RIPE-160() was equal to )
OP_CHECKSIG: checks that sig is a valid signature using the public key

Quote
I'm not really understanding I guess how the input scripts match up to the output scripts and how current state of the stack is retained; I mean do you like run an input script; which pushes some stuff on the stack for the output script, or vice versa?  And, if so, how do you know which input scripts match which output scripts?

Right, each is half.

Quote
Regarding my other question about the 'public-key' part of bitcoin addresses, I found that it's using this 'Elliptic Curve Digital Signature Algorithm' thing.  What I can't find is a reference to how one converts the binary data for the key into the ASCII representation; this would be useful for debugging.

Addresses are encoded using Base 58 encoding.  Normal keys etc are normally shown using hex encoding.
legendary
Activity: 2053
Merit: 1356
aka tonikt
I guess I'm still a little confused on that point.  I started writing a script parser, but the scripts I try to run against it are all kind of incomplete.  The input scripts generally just push data on the stack but have no actual instructions to 'do' anything.
That's what the output scripts are.
The spending scripts pull this data from the stack and verify against signatures.

What is put on the stack is the public key that you need to convert into 25 bytes of a "version+hash+chksum" - then you base58 encode it and that is the address.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
You shouldn't be surprised there's no timestamps on transactions - that's the whole point of Bitcoin which is for a whole network of thousands of independent nodes to all agree on the ordering of transactions.   In essence, a transaction doesn't even really exist until it is included in a block.  Therefore, the block timestamp is when the TX became "real".  Everyone has different clocks and receives the tx in different orders and times.  The point of Bitcoin is for this network with no central authority to agree on which ones actually happened.

As for C++ code, go look at the Armory code base which had code that does exactly what you're looking for in the  cppForSWIG directory.  BlockUtils cppForSWIG.   Should be a function called parseEntireBlockchain of something like that.
member
Activity: 82
Merit: 10
Ok, thanks, that helps quite a bit.

I'm really surprised that transactions don't have a time stamp.  You mean there is no way to know 'when' a transaction occurred?

So, it sounds like I don't need to parse the scripts to know the transactions but if I want to know the source/destination addresses of the public keys I *do* have to parse the scripts?

I guess I'm still a little confused on that point.  I started writing a script parser, but the scripts I try to run against it are all kind of incomplete.  The input scripts generally just push data on the stack but have no actual instructions to 'do' anything.

The output scripts tend to execute an instruction such as 'OP_CHECKSIG' which requires that two items are on the stack, but when I get here there is usually only one item on the stack; and I'm on clear on what was supposed to have been executed prior to it.

I'm not really understanding I guess how the input scripts match up to the output scripts and how current state of the stack is retained; I mean do you like run an input script; which pushes some stuff on the stack for the output script, or vice versa?  And, if so, how do you know which input scripts match which output scripts?

For example, the block #1 output script pushes 65 bytes of data on the stack and then tries to execute OP_CHECKSIG which immediately fails in my VM and the reference code for bitcoin-qt because it requires there be at least two arguments on the stack.  

Regarding my other question about the 'public-key' part of bitcoin addresses, I found that it's using this 'Elliptic Curve Digital Signature Algorithm' thing.  What I can't find is a reference to how one converts the binary data for the key into the ASCII representation; this would be useful for debugging.

Sorry for the annoying questions but I will try to make up for it by posting some articles and open-source code snippets if I ever get this transaction analysis tool working.

Does anyone know of a C/C++ program someone else has already written that does what I'm trying to do here; just walk the block-chain and track transactions?  All of these code bases have sooo...many dependencies that you dive into crytographic hell trying to decipher them.

Thanks,

John
sr. member
Activity: 260
Merit: 250
This is pretty cool! Thanks for doing this. I'll check it out over the weekend.
legendary
Activity: 1246
Merit: 1002
Ok, I think I'm making progress understanding this; a bit.

(5) On BlockChain info, if you just look at Block #1 https://blockchain.info/block/00000000839a8e6886ab5951d76f411475428afc90947ee320161bbf18eb6048  It specifies both a hash and a previous block hash.  I'm a little confused on that point; there is no 'previous' block to block #1.  Also, there is no 'hash' field on block-chain data structure; only a previous hash field.


Block #0 exists.  Your link has a field for "Previous Block" on the right hand side.
legendary
Activity: 1232
Merit: 1094
(1) Can you confirm that it is true that a program which only wants to navigate the transaction history of the blockchain does not actually have to parse either the input or the output scripts?  This is true, correct?  Those scripts are only used for validating transactions but since I'm reading the blockchain off of my hard drive, they represent all properly validated transactions. Correct?

(2) All of the information needed to understand the blockchain (i.e. see the same kind of information shown on blockchain.info) can be found simply by parsing the blockchain data files (blkNNNNN.dat) files?

Right

Quote
(3) When looking at the block data; there is a timestamp for the block; but I don't see a timestamp for individual transactions.  Is this correct?

Transactions don't have timestamps.  They are "locked" to a time when they are included in a block.  If they had a timestamp, then they would only work for a particular block, but miners decide which blocks to place transactions into. 

They do have a "locktime".  Transactions cannot be added into a block until that time has arrived (or that block number if < 50000000 or so).

Quote
(4) When processing an input it specifies an input transaction hash and a transaction index number; how does one go from that hash/index and figure out a specific output transaction in a previous block?

You need to create a map of hash(transaction) -> transaction (possibly file pointer).

You have to hash the entire transaction as it appears in the file.  That is sha256(sha256(transaction)).  Sometimes people get the big-endian/little-endian the wrong way around.

Quote
(5) On BlockChain info, if you just look at Block #1 https://blockchain.info/block/00000000839a8e6886ab5951d76f411475428afc90947ee320161bbf18eb6048  It specifies both a hash and a previous block hash.  I'm a little confused on that point; there is no 'previous' block to block #1.  Also, there is no 'hash' field on block-chain data structure; only a previous hash field.

Yes there is, blocks start from block 0, there is no previous to block zero though. 

However, the genesis block is presumably not in the file, since it is hard-coded.

Quote
(6) On the same block #1 page, it shows an output address '12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX'.  Where does that value come from?  Was it in the script?  I don't see it anywhere in the data structures?  Also, this address isn't a hex value; is there some published algorithm that converts a binary hash into this ASCII string format?

They calculate that from the script.  There are a few standard transactions.
member
Activity: 82
Merit: 10
Ok, I think I'm making progress understanding this; a bit.

I now understand that each block as a certain number of transactions and each transaction has a certain number of inputs and outputs (potentially representing where btc is coming from and where it is going to).

Could you guys help clarify some things in regards to simply parsing the raw blockchain data to get a meaningful transaction history out?

(1) Can you confirm that it is true that a program which only wants to navigate the transaction history of the blockchain does not actually have to parse either the input or the output scripts?  This is true, correct?  Those scripts are only used for validating transactions but since I'm reading the blockchain off of my hard drive, they represent all properly validated transactions. Correct?

(2) All of the information needed to understand the blockchain (i.e. see the same kind of information shown on blockchain.info) can be found simply by parsing the blockchain data files (blkNNNNN.dat) files?

(3) When looking at the block data; there is a timestamp for the block; but I don't see a timestamp for individual transactions.  Is this correct?

(4) When processing an input it specifies an input transaction hash and a transaction index number; how does one go from that hash/index and figure out a specific output transaction in a previous block?

(5) On BlockChain info, if you just look at Block #1 https://blockchain.info/block/00000000839a8e6886ab5951d76f411475428afc90947ee320161bbf18eb6048  It specifies both a hash and a previous block hash.  I'm a little confused on that point; there is no 'previous' block to block #1.  Also, there is no 'hash' field on block-chain data structure; only a previous hash field.

(6) On the same block #1 page, it shows an output address '12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX'.  Where does that value come from?  Was it in the script?  I don't see it anywhere in the data structures?  Also, this address isn't a hex value; is there some published algorithm that converts a binary hash into this ASCII string format?

Thanks,

John
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
The TxIns and TxOuts are independent.  Think of the network as containing a bunch of floating [unspent coins].  Each of those unspent TxOuts is like a bill (like $20 bill).  The TxIn is simply a reference to a particular bill with a signature proving you are authorized to spend it.  If you want to send someone 10 BTC, you need to collect enough bills (say, a 4 BTC bill/TxOut, a 2 BTC bill/TxOut, and a 7 BTC bill/TxOut).  10BTC output, and one output sending the remainder back to yourself (3 BTC).  

You can think of transactions as eating unspent TxOuts, and producing out new TxOuts (and you are authorizing the spending of the old TxOuts by the signatures in the TxIns). A transaction takes TxOuts of some ownership, and converts them into TxOuts of a new ownership.  You can split them or merge them in any quantity, as long as you have at least one TxIn (corresponding to a previous TxOut being eaten) and one new TxOut.

Here's a random picture I made a long time ago that attempts to illustrate it.  



The dotted lines for the TxIns illustrates that they are really kind of "transient" data.  They're simply for the network to see that you were authorized to move those coins, but in the far future, the old spent TxOuts, and the TxIns that moved them can all be pruned off.
legendary
Activity: 1232
Merit: 1094
There are (n1) input scripts and (n2) output scripts.

Right

Quote
Does n1 always equal n2?

No

Quote
Does each input match up with each output?

This works for certain kinds of multi-sig.  You can say that your signature is conditional on a the matching output being unchanged (but you don't care about any of the other outputs).

Quote
In other words, in what order are the scripts executed and do they share the same stack?

They are all independent.  However, some of the hashing steps for signatures blank include/exclude some outputs.
member
Activity: 82
Merit: 10
Does anyone know of a better explanation of how script processing works than I have found online?

Or could someone post an pseudo-code style description here?

My question is this.

There are (n1) input scripts and (n2) output scripts.

Does n1 always equal n2?

Does each input match up with each output?

In other words, in what order are the scripts executed and do they share the same stack?

For example, does input one script push values on the stack which are then popped by output script one?

It's not entirely clear to me must from reading the online documentation the order of execution and how the stack is shared between scripts.

Thanks,

John

Pages:
Jump to: