Author

Topic: How do I crawl through the block chain? (Read 15775 times)

member
Activity: 82
Merit: 10
January 08, 2014, 04:22:35 PM
#12
FWIW, I just released a binary Windows console app (64 bit) that will let you parse the whole blockchain and do queries against it.  Since it doesn't write the data to a database, instead, it holds the whole thing in memory, you will need a beefy machine to run it as it allocates about 10gb of memory to parse the whole blockchain.

http://codesuppository.blogspot.com/2014/01/a-command-line-interface-for-blockchain.html
newbie
Activity: 16
Merit: 0
January 08, 2014, 02:56:26 AM
#11
I want to anwer questions like "how many transaction with lower input value than 1btc have ever been completed after 1.1.2013" and similar.

You basically want to parse the blockchain into a database like MySQL/Postgres so that you can answer that question by executing a query against it.

For example, I parsed it into a database called Datomic and that query looks like this:

https://i.imgur.com/tQcuIN8.png

The blocks are stored in the blkXXXXX.dat files found in the ~/.bitcoin directory.

Each blkdat file is a sequence of block data where each block can be parsed like this:

  • Magic bytes: uint32-le
  • Byte-count of the block: uint32-le
  • Block: BlockCodec

The file may be padded at the end with null-bytes. You also need to know that blkdat files are not a validated blockchain, just the blocks that bitcoind received over the network. The easiest thing to do is just add all blocks to the database and assume that the longest chain of `latestBlock->prevBlock->prevBlock->...->genesisBlock` is the one you want to query.

Database schema looks something like this:

- Block has many Txns
- Block has one PreviousBlock
- Txn has many TxOuts
- Txn has many TxIns

The pseudocode for it all is pretty simple and parsing is straight forward.

Word of warning: I went so far as to get a local blockexplorer working (https://github.com/danneu/chaingun), but of course I wandered off into the dragon lair of reimplementing blockchain validation and ran out of free time. It became evident that I was in fact climbing a large mountain when I passed by Jon Krakauer's corpse frozen in the snow. At which point I realized I wasn't climbing a mountain at all but slowly sliding down one slippery slope. And when I reached the bottom and turned back around, I realized it wasn't even a slope but instead a heap of my own yak shavings.

I have'nt met many people that use Datomic. And now I think about it the blockchain is actually quite datomic like. (none stateful , append only). How are you finding datomic as a data store?

I have always been interested in these new data stores, but postgresql/mysql gives you so much flexibility, powerful, rich querying syntax. And for blockchain data (only a few dozens inserts per minutes), it performs extremely well. My question is: what advantage do you feel datomic has over something like postgresql?

It is purely out of my curiosity for datomic
newbie
Activity: 32
Merit: 0
January 07, 2014, 02:39:59 PM
#10
I want to anwer questions like "how many transaction with lower input value than 1btc have ever been completed after 1.1.2013" and similar.

You basically want to parse the blockchain into a database like MySQL/Postgres so that you can answer that question by executing a query against it.

For example, I parsed it into a database called Datomic and that query looks like this:

https://i.imgur.com/tQcuIN8.png

The blocks are stored in the blkXXXXX.dat files found in the ~/.bitcoin directory.

Each blkdat file is a sequence of block data where each block can be parsed like this:

  • Magic bytes: uint32-le
  • Byte-count of the block: uint32-le
  • Block: BlockCodec

The file may be padded at the end with null-bytes. You also need to know that blkdat files are not a validated blockchain, just the blocks that bitcoind received over the network. The easiest thing to do is just add all blocks to the database and assume that the longest chain of `latestBlock->prevBlock->prevBlock->...->genesisBlock` is the one you want to query.

Database schema looks something like this:

- Block has many Txns
- Block has one PreviousBlock
- Txn has many TxOuts
- Txn has many TxIns

The pseudocode for it all is pretty simple and parsing is straight forward.

Word of warning: I went so far as to get a local blockexplorer working (https://github.com/danneu/chaingun), but of course I wandered off into the dragon lair of reimplementing blockchain validation and ran out of free time. It became evident that I was in fact climbing a large mountain when I passed by Jon Krakauer's corpse frozen in the snow. At which point I realized I wasn't climbing a mountain at all but slowly sliding down one slippery slope. And when I reached the bottom and turned back around, I realized it wasn't even a slope but instead a heap of my own yak shavings.
newbie
Activity: 14
Merit: 0
January 07, 2014, 12:02:42 PM
#9
This is awesome stuff Smiley
sr. member
Activity: 294
Merit: 260
January 06, 2014, 06:10:54 PM
#8
http://codesuppository.blogspot.com/2013/07/bitcoin-code-snippets.html

I am also writing a thorough piece of documentation for this exact topic right now.  You can read what I have so far here, I created a detailed graphic.  I don't expect to be completely finished with the article until tomorrow.

http://codesuppository.blogspot.com/2014/01/how-to-parse-bitcoin-blockchain.html

Thanks for these!
legendary
Activity: 1526
Merit: 1134
January 06, 2014, 01:37:41 PM
#7
You could do that with bitcoinj pretty easily in Java or any other language that runs on the JVM. Set things up with a custom BlockChainListener, point it at either the p2p network or a bitcoind node you control, and your callback can receive every transaction in the chain. Typically people load it into a database so they can do queries later.
hero member
Activity: 492
Merit: 503
January 06, 2014, 01:31:25 PM
#6
John's documentation is awesome. I have been following it and building my own parser

Ha! This stuff is infectious isn't it?
hero member
Activity: 492
Merit: 503
January 06, 2014, 01:30:32 PM
#5
Interesting timing for me! About a week ago I decided I was finally going to start learning about all the code underneath the bitcoin protocol and software. Except I wasn't going to actually examine any code (I don't know much C++), I was going to read lots of technical articles and write my own code from the bottom up. Also I was going to do it in Python. Which I had not yet learned. Why? "Because it's there".

So anyway a week later I've read the Satoshi paper, the protocol-rules and protocol-spec docs on the wiki, that very helpful '285 bytes that changed the world', and this: http://www.michaelnielsen.org/ddi/how-the-bitcoin-protocol-actually-works/ . Plus I've learned some python. And here's what I've got to show for it:

Code:
#!/usr/bin/python
from time import ctime

def behex(x):
   "Converts the given little-endian slice of bytes into big-endian hex digits"
   return x[::-1].encode('hex')

def varint(x):
   """Converts the bytes-slice to a list: first number is the number of bytes"""
   """the varint takes up. Second number is the actual int value. Recommend"""
   """only passing 9 bytes to this function, but extras will be ignored"""
   a = int(x[0].encode('hex'), 16)
   if a < 253: return (1, a)
   elif a == 253: return (3, int(behex(x[1:3]), 16))
   elif a == 254: return (5, int(behex(x[1:5]), 16))
   elif a == 255: return (9, int(behex(x[1:9]), 16))

class TrInput:
   """A single input in a single transaction. We create it from a slice"""
"""of bytes, but do NOT make assumptions about where it ends. We"""
"""infer that from script length, and further bytes are ignored."""
def __init__(self, data):
self.hash = data[0:32] ; self.index = data[32:36]
(off1, scrlen) = varint(data[36:45])
off1 = off1 + 36 ; off2 = off1 + scrlen ; self.size = off2 + 4
self.script = data[off1:off2] ; self.seqnum = data[off2:self.size]

class TrOutput:
"""A single output in a single transaction. We create it from a slice"""
"""of bytes, but do NOT make assumptions about where it ends. We """
"""infer that from script length, and further bytes are ignored."""
def __init__(self, data):
self.satval = data[0:8] ; (offset, scrlen) = varint(data[8:17])
offset = offset + 8 ; self.size = offset + scrlen
self.script = data[offset:self.size]

class Transaction:
"""Data is a slice of bytes, should start with version"""
"""but can end anywhere, excess bytes are ignored."""
def __init__(self, data):
self.version = int(behex(data[0:4]), 16)
(off1, ninps) = varint(data[4:13]) ; off1 = off1 + 4 ; self.inp = []
for i in range(ninps):
self.inp.append(TrInput(data[off1:]))
off1 = off1 + self.inp[i].size
(off2, nouts) = varint(data[off1:off1+9])
off2 = off2 + off1 ; self.out = []
for i in range(nouts):
self.out.append(TrOutput(data[off2:]))
off2 = off2 + self.out[i].size
self.size = off2 + 4 ; self.locktime = behex(data[off2:self.size])

class Block:
def __init__(self, blkdata):
self.size = len(blkdata) ; self.version = int(behex(blkdata[0:4]), 16)
self.prevhash = blkdata[4:36] ; self.merkle = blkdata[36:68]
self.tstamp = int(behex(blkdata[68:72]), 16)
self.target = behex(blkdata[72:76])
self.nonce = behex(blkdata[76:80])
(offset, self.ntrans) = varint(blkdata[80:89])
offset = offset + 80 ; self.trans = []
for i in range(self.ntrans):
self.trans.append(Transaction(blkdata[offset:]))
offset = offset + self.trans[i].size

def checkmagicbytes(fname):
"Reads 4 bytes from current position in file, checks they're 'magic'"
x = fname.read(4) # why can't I ';' these two lines? (and next two)
if len(x) == 0: return 0 # ???
y = int(behex(x), 16)
if y != 0xD9B4BEF9: return 0
return 1

def target2diff(t):
"Converts the target in 4-byte hexstring form into difficulty (float)"
# MAY not agree with diff given in block explorer for high diffs - check!
i1 = int(t, 16) ; i2 = float(2**(248 - i1/(2**21)))/(i1 % (2**24))
return i2 * 65535/65536 # I THINK that fudge factor fixes things...

bn = 0 ; fn = 0 ; chain = []
while True:
fname = "/home/chris/.bitcoin/blocks/blk{:0=5}.dat".format(fn)
try: bcfile = open(fname,"rb")
except IOError: break
while checkmagicbytes(bcfile): # issue about two failure modes!
bkdata = bcfile.read(int(behex(bcfile.read(4)),16))
chain.append(Block(bkdata)) ; bn = bn + 1
if bn > 100000: break # just for memory purposes for now
fn = fn + 1 ; bcfile.close()
if bn > 100000: break # ditto

bn = bn - 1 ; fn = fn - 1 ; print "Chain ends: block", bn, "file", fn
opt = int(raw_input("Enter a block number:"))
while opt <= bn:
print "Version:", chain[opt].version
print "Prev hash:", behex(chain[opt].prevhash)
print "Merkle root:", behex(chain[opt].merkle)
print "Timestamp:", ctime(chain[opt].tstamp)
print "Difficulty:", target2diff(chain[opt].target)
print "Nonce:", chain[opt].nonce
trs = chain[opt].trans ; n = chain[opt].ntrans
print "# of transactions:", n, "size:", chain[opt].size
for t in range(n):
print "Transaction", t, "Version", trs[t].version,
print "Locktime", trs[t].locktime ; ni = len(trs[t].inp)
# input and output data still needs to be presented readably...
for i in range(ni):
inp = trs[t].inp[i]
print "Input:", i
print "Hash:", inp.hash
print "Index:", inp.index
print "Script:", inp.script
print "Seqnum:", inp.seqnum
no = len(trs[t].out)
for o in range(no):
out = trs[t].out[i]
print "Output:", o
print "Value:", out.satval
print "Script:", out.script
opt = int(raw_input("Enter another block number:"))


Feel free to use it, and any constructive comments/help/tips most welcome. It doesn't do too much. You need the blockchain as downloaded by the official client, it'll read the first 100000 blocks (or more if you remove the brakes, but then be careful about your memory filling up). Then you can enter a block height number and see what's in it. The next things on my to-do list are (a) understand the scripts in the inputs and outputs, (b) shove SHA256 stuff in so I can compute block hashes, merkle roots, transaction ids etc, and (c) use all that to validate the blockchain (at least the strictly linear version - no forks for me yet). Then I guess (d) - work out the most compact way of storing the relevant info from the first n blocks (where, say, n=current block height - a few hundred) in memory. I think they call this 'pruning'.

And I've only NOW started reading John Ratcliff's 'code suppository'  Grin. It looks good!
newbie
Activity: 16
Merit: 0
January 06, 2014, 12:45:47 AM
#4
http://codesuppository.blogspot.com/2013/07/bitcoin-code-snippets.html

I am also writing a thorough piece of documentation for this exact topic right now.  You can read what I have so far here, I created a detailed graphic.  I don't expect to be completely finished with the article until tomorrow.

http://codesuppository.blogspot.com/2014/01/how-to-parse-bitcoin-blockchain.html

John's documentation is awesome. I have been following it and building my own parser
sr. member
Activity: 280
Merit: 257
bluemeanie
January 04, 2014, 06:41:19 PM
#3
you can try my Bitcoin Graph DB importer.  From there you can do data mining and other sorts of operations.

https://github.com/BlueMeanie/bitcoingraphdb

-bm
member
Activity: 82
Merit: 10
January 04, 2014, 04:35:08 PM
#2
http://codesuppository.blogspot.com/2013/07/bitcoin-code-snippets.html

I am also writing a thorough piece of documentation for this exact topic right now.  You can read what I have so far here, I created a detailed graphic.  I don't expect to be completely finished with the article until tomorrow.

http://codesuppository.blogspot.com/2014/01/how-to-parse-bitcoin-blockchain.html
hero member
Activity: 546
Merit: 500
hm
January 04, 2014, 12:07:15 PM
#1
Hey guys,

I have a little time and like to program for fun a little bit. So I thought I want to crawl the block chain I have stored on my computer. I have knowledge about Java and asked myself, how the transactions are stored and how can I enter them. Maybe with c++ too learn it, but I prefer Java. I want to anwer questions like "how many transaction with lower input value than 1btc have ever been completed after 1.1.2013" and similar.

Can anyone help me? Maybe I can present some weird facts in a few days or I am just not depenend on others to answer such questions.

Links or guides would be great, if they exist.

thanks

Jump to: