How do I crawl through the block chain?

jratcliff63367

member

Activity: 82

Merit: 10

FWIW, I just released a binary Windows console app (64 bit) that will let you parse the whole blockchain and do queries against it. Since it doesn't write the data to a database, instead, it holds the whole thing in memory, you will need a beefy machine to run it as it allocates about 10gb of memory to parse the whole blockchain.

http://codesuppository.blogspot.com/2014/01/a-command-line-interface-for-blockchain.html

sidazhang

newbie

Activity: 16

Merit: 0

Quote from: danneu on January 07, 2014, 02:39:59 PM

Quote from: porcupine87 on January 04, 2014, 12:07:15 PM

I want to anwer questions like "how many transaction with lower input value than 1btc have ever been completed after 1.1.2013" and similar.

You basically want to parse the blockchain into a database like MySQL/Postgres so that you can answer that question by executing a query against it.

For example, I parsed it into a database called Datomic and that query looks like this:

https://i.imgur.com/tQcuIN8.png

The blocks are stored in the blkXXXXX.dat files found in the ~/.bitcoin directory.

Each blkdat file is a sequence of block data where each block can be parsed like this:

Magic bytes: uint32-le
Byte-count of the block: uint32-le
Block: BlockCodec

The file may be padded at the end with null-bytes. You also need to know that blkdat files are not a validated blockchain, just the blocks that bitcoind received over the network. The easiest thing to do is just add all blocks to the database and assume that the longest chain of `latestBlock->prevBlock->prevBlock->...->genesisBlock` is the one you want to query.

Database schema looks something like this:

- Block has many Txns
- Block has one PreviousBlock
- Txn has many TxOuts
- Txn has many TxIns

The pseudocode for it all is pretty simple and parsing is straight forward.

Word of warning: I went so far as to get a local blockexplorer working (https://github.com/danneu/chaingun), but of course I wandered off into the dragon lair of reimplementing blockchain validation and ran out of free time. It became evident that I was in fact climbing a large mountain when I passed by Jon Krakauer's corpse frozen in the snow. At which point I realized I wasn't climbing a mountain at all but slowly sliding down one slippery slope. And when I reached the bottom and turned back around, I realized it wasn't even a slope but instead a heap of my own yak shavings.

I have'nt met many people that use Datomic. And now I think about it the blockchain is actually quite datomic like. (none stateful , append only). How are you finding datomic as a data store?

I have always been interested in these new data stores, but postgresql/mysql gives you so much flexibility, powerful, rich querying syntax. And for blockchain data (only a few dozens inserts per minutes), it performs extremely well. My question is: what advantage do you feel datomic has over something like postgresql?

It is purely out of my curiosity for datomic

danneu

newbie

Activity: 32

Merit: 0

Quote from: porcupine87 on January 04, 2014, 12:07:15 PM

I want to anwer questions like "how many transaction with lower input value than 1btc have ever been completed after 1.1.2013" and similar.

You basically want to parse the blockchain into a database like MySQL/Postgres so that you can answer that question by executing a query against it.

For example, I parsed it into a database called Datomic and that query looks like this:

https://i.imgur.com/tQcuIN8.png

The blocks are stored in the blkXXXXX.dat files found in the ~/.bitcoin directory.

Each blkdat file is a sequence of block data where each block can be parsed like this:

Magic bytes: uint32-le
Byte-count of the block: uint32-le
Block: BlockCodec

The file may be padded at the end with null-bytes. You also need to know that blkdat files are not a validated blockchain, just the blocks that bitcoind received over the network. The easiest thing to do is just add all blocks to the database and assume that the longest chain of `latestBlock->prevBlock->prevBlock->...->genesisBlock` is the one you want to query.

Database schema looks something like this:

- Block has many Txns
- Block has one PreviousBlock
- Txn has many TxOuts
- Txn has many TxIns

The pseudocode for it all is pretty simple and parsing is straight forward.

Word of warning: I went so far as to get a local blockexplorer working (https://github.com/danneu/chaingun), but of course I wandered off into the dragon lair of reimplementing blockchain validation and ran out of free time. It became evident that I was in fact climbing a large mountain when I passed by Jon Krakauer's corpse frozen in the snow. At which point I realized I wasn't climbing a mountain at all but slowly sliding down one slippery slope. And when I reached the bottom and turned back around, I realized it wasn't even a slope but instead a heap of my own yak shavings.

toxigenicpoem

newbie

Activity: 14

Merit: 0

This is awesome stuff

antanst

sr. member

Activity: 294

Merit: 260

Quote from: jratcliff63367 on January 04, 2014, 04:35:08 PM

http://codesuppository.blogspot.com/2013/07/bitcoin-code-snippets.html

I am also writing a thorough piece of documentation for this exact topic right now. You can read what I have so far here, I created a detailed graphic. I don't expect to be completely finished with the article until tomorrow.

http://codesuppository.blogspot.com/2014/01/how-to-parse-bitcoin-blockchain.html

Thanks for these!

Mike Hearn

legendary

Activity: 1526

Merit: 1134

You could do that with bitcoinj pretty easily in Java or any other language that runs on the JVM. Set things up with a custom BlockChainListener, point it at either the p2p network or a bitcoind node you control, and your callback can receive every transaction in the chain. Typically people load it into a database so they can do queries later.

Nancarrow

hero member

Activity: 492

Merit: 503

Quote from: sidazhang on January 06, 2014, 12:45:47 AM

John's documentation is awesome. I have been following it and building my own parser

Ha! This stuff is infectious isn't it?

Nancarrow

hero member

Activity: 492

Merit: 503

Interesting timing for me! About a week ago I decided I was finally going to start learning about all the code underneath the bitcoin protocol and software. Except I wasn't going to actually examine any code (I don't know much C++), I was going to read lots of technical articles and write my own code from the bottom up. Also I was going to do it in Python. Which I had not yet learned. Why? "Because it's there".

So anyway a week later I've read the Satoshi paper, the protocol-rules and protocol-spec docs on the wiki, that very helpful '285 bytes that changed the world', and this: http://www.michaelnielsen.org/ddi/how-the-bitcoin-protocol-actually-works/ . Plus I've learned some python. And here's what I've got to show for it:

Code:

#!/usr/bin/python
from time import ctime

def behex(x):
   "Converts the given little-endian slice of bytes into big-endian hex digits"
   return x[::-1].encode('hex')

def varint(x):
   """Converts the bytes-slice to a list: first number is the number of bytes"""
   """the varint takes up. Second number is the actual int value. Recommend"""
   """only passing 9 bytes to this function, but extras will be ignored"""
   a = int(x[0].encode('hex'), 16)
   if a < 253: return (1, a)
   elif a == 253: return (3, int(behex(x[1:3]), 16))
   elif a == 254: return (5, int(behex(x[1:5]), 16))
   elif a == 255: return (9, int(behex(x[1:9]), 16))

class TrInput:
   """A single input in a single transaction. We create it from a slice"""
	"""of bytes, but do NOT make assumptions about where it ends. We"""
	"""infer that from script length, and further bytes are ignored."""
	def __init__(self, data):
		self.hash = data[0:32] ; self.index = data[32:36]
		(off1, scrlen) = varint(data[36:45])
		off1 = off1 + 36 ; off2 = off1 + scrlen ; self.size = off2 + 4
		self.script = data[off1:off2] ; self.seqnum = data[off2:self.size]

class TrOutput:
	"""A single output in a single transaction. We create it from a slice"""
	"""of bytes, but do NOT make assumptions about where it ends. We """
	"""infer that from script length, and further bytes are ignored."""
	def __init__(self, data):
		self.satval = data[0:8] ; (offset, scrlen) = varint(data[8:17])
		offset = offset + 8 ;	self.size = offset + scrlen
		self.script = data[offset:self.size]

class Transaction:
	"""Data is a slice of bytes, should start with version"""
	"""but can end anywhere, excess bytes are ignored."""
	def __init__(self, data):
		self.version = int(behex(data[0:4]), 16)
		(off1, ninps) = varint(data[4:13]) ; off1 = off1 + 4 ; self.inp = []
		for i in range(ninps):
			self.inp.append(TrInput(data[off1:]))
			off1 = off1 + self.inp[i].size
		(off2, nouts) = varint(data[off1:off1+9])
		off2 = off2 + off1 ; self.out = []
		for i in range(nouts):
			self.out.append(TrOutput(data[off2:]))
			off2 = off2 + self.out[i].size
		self.size = off2 + 4 ; self.locktime = behex(data[off2:self.size])

class Block:
	def __init__(self, blkdata):
		self.size = len(blkdata) ;	self.version = int(behex(blkdata[0:4]), 16)
		self.prevhash = blkdata[4:36] ; self.merkle = blkdata[36:68]
		self.tstamp = int(behex(blkdata[68:72]), 16)
		self.target = behex(blkdata[72:76])
		self.nonce = behex(blkdata[76:80])
		(offset, self.ntrans) = varint(blkdata[80:89])
		offset = offset + 80 ; self.trans = []
		for i in range(self.ntrans):
			self.trans.append(Transaction(blkdata[offset:]))
			offset = offset + self.trans[i].size

def checkmagicbytes(fname):
	"Reads 4 bytes from current position in file, checks they're 'magic'"
	x = fname.read(4) # why can't I ';' these two lines? (and next two) 
	if len(x) == 0: return 0 # ???
	y = int(behex(x), 16)
	if y != 0xD9B4BEF9: return 0
	return 1

def target2diff(t):
	"Converts the target in 4-byte hexstring form into difficulty (float)"
	# MAY not agree with diff given in block explorer for high diffs - check!
	i1 = int(t, 16) ; i2 = float(2**(248 - i1/(2**21)))/(i1 % (2**24))
	return i2 * 65535/65536	# I THINK that fudge factor fixes things...
	
bn = 0 ; fn = 0 ; chain = []
while True:
	fname = "/home/chris/.bitcoin/blocks/blk{:0=5}.dat".format(fn)
	try: bcfile = open(fname,"rb")
	except IOError: break
	while checkmagicbytes(bcfile): # issue about two failure modes!
		bkdata = bcfile.read(int(behex(bcfile.read(4)),16))
		chain.append(Block(bkdata)) ; bn = bn + 1
		if bn > 100000: break # just for memory purposes for now
	fn = fn + 1 ;	bcfile.close()
	if bn > 100000: break # ditto

bn = bn - 1 ; fn = fn - 1 ; print "Chain ends: block", bn, "file", fn
opt = int(raw_input("Enter a block number:"))
while opt <= bn:
	print "Version:", chain[opt].version
	print "Prev hash:", behex(chain[opt].prevhash)
	print "Merkle root:", behex(chain[opt].merkle)
	print "Timestamp:", ctime(chain[opt].tstamp)
	print "Difficulty:", target2diff(chain[opt].target)
	print "Nonce:", chain[opt].nonce
	trs = chain[opt].trans ; n = chain[opt].ntrans
	print "# of transactions:", n, "size:", chain[opt].size
	for t in range(n):
		print "Transaction", t, "Version", trs[t].version,
		print "Locktime", trs[t].locktime ; ni = len(trs[t].inp)
		# input and output data still needs to be presented readably...
		for i in range(ni):
			inp = trs[t].inp[i]
			print "Input:", i
			print "Hash:", inp.hash
			print "Index:", inp.index
			print "Script:", inp.script
			print "Seqnum:", inp.seqnum
		no = len(trs[t].out)
		for o in range(no):
			out = trs[t].out[i]
			print "Output:", o
			print "Value:", out.satval
			print "Script:", out.script
	opt = int(raw_input("Enter another block number:"))

Feel free to use it, and any constructive comments/help/tips most welcome. It doesn't do too much. You need the blockchain as downloaded by the official client, it'll read the first 100000 blocks (or more if you remove the brakes, but then be careful about your memory filling up). Then you can enter a block height number and see what's in it. The next things on my to-do list are (a) understand the scripts in the inputs and outputs, (b) shove SHA256 stuff in so I can compute block hashes, merkle roots, transaction ids etc, and (c) use all that to validate the blockchain (at least the strictly linear version - no forks for me yet). Then I guess (d) - work out the most compact way of storing the relevant info from the first n blocks (where, say, n=current block height - a few hundred) in memory. I think they call this 'pruning'.

And I've only NOW started reading John Ratcliff's 'code suppository' Grin

. It looks good!

sidazhang

newbie

Activity: 16

Merit: 0

Quote from: jratcliff63367 on January 04, 2014, 04:35:08 PM

http://codesuppository.blogspot.com/2013/07/bitcoin-code-snippets.html

I am also writing a thorough piece of documentation for this exact topic right now. You can read what I have so far here, I created a detailed graphic. I don't expect to be completely finished with the article until tomorrow.

http://codesuppository.blogspot.com/2014/01/how-to-parse-bitcoin-blockchain.html

John's documentation is awesome. I have been following it and building my own parser

bluemeanie1

sr. member

Activity: 280

Merit: 257

bluemeanie

you can try my Bitcoin Graph DB importer. From there you can do data mining and other sorts of operations.

https://github.com/BlueMeanie/bitcoingraphdb

-bm

jratcliff63367

member

Activity: 82

Merit: 10

http://codesuppository.blogspot.com/2013/07/bitcoin-code-snippets.html

I am also writing a thorough piece of documentation for this exact topic right now. You can read what I have so far here, I created a detailed graphic. I don't expect to be completely finished with the article until tomorrow.

http://codesuppository.blogspot.com/2014/01/how-to-parse-bitcoin-blockchain.html

porcupine87

hero member

Activity: 546

Merit: 500

hm

Hey guys,

I have a little time and like to program for fun a little bit. So I thought I want to crawl the block chain I have stored on my computer. I have knowledge about Java and asked myself, how the transactions are stored and how can I enter them. Maybe with c++ too learn it, but I prefer Java. I want to anwer questions like "how many transaction with lower input value than 1btc have ever been completed after 1.1.2013" and similar.

Can anyone help me? Maybe I can present some weird facts in a few days or I am just not depenend on others to answer such questions.

Links or guides would be great, if they exist.

thanks

Topic: How do I crawl through the block chain? (Read 15786 times)