Author

Topic: Blockchain reorganization process logic. rev*.dat format and byte map. (Read 253 times)

newbie
Activity: 3
Merit: 10
https://i.stack.imgur.com/aVI1a.png

NOTES:

Heigh and Version fields
Fields height and version are needed to identify Outputs if they are last spent in the Transaction outputs. In older revxx.dat files, only if Output was last unspent in a Transaction it's hight will be there together with it’s Transaction’s Version. If it's a regular Output then the height field will be 0 (00 HEX in Varint format ) and there will be no Version field

In newer revxx.dat files height is written for every Output and the Version field is 0 (00 in HEX) to be backward compatible. I did not check the revxx.dat files myself if it’s so, so please test this note first.

Calculation of Height from the Field Value:
To compress the Amount of Statoshis this field uses the following logic:
If the amount is 0, output 0
First, divide the amount (in base units) by the largest power of 10 possible; call the exponent e (e is max 9)
if e < 9, the last digit of the resulting number cannot be 0; store it as d, and drop it (divide by 10)
   call the result n
  output 1 + 10*(9*n + d - 1) + e
 if e==9, we only know the resulting number is not zero, so output 1 + 10*(n - 1) + 9
(this is decodable, as d is in [1-9] and e is in [0-9])
Then store it as a VARINT


nValue field(Amount)

To compress the Amount of Statoshis this field uses the following logic:

If the amount is 0, output 0

First, divide the amount (in base units) by the largest power of 10 possible; call the exponent e (e is max 9)

if e < 9, the last digit of the resulting number cannot be 0; store it as d, and drop it (divide by 10)

call the result n

output 1 + 10\*(9\*n + d - 1) + e

if e==9, we only know the resulting number is not zero, so output 1 + 10\*(n - 1) + 9

(this is decodable, as d is in [1-9] and e is in [0-9])

Then store it as a VARINT


Calculation of nValue field(Amount):

To decode this amount go through the operation backward:

First, decode it from VARINT. Then use the following procedure(taken directly from the Bitcoin core implementation):

uint64\_t DecompressAmount(uint64\_t x)

{

// x = 0 OR x = 1+10\*(9\*n + d - 1) + e OR x = 1+10\*(n - 1) + 9

if (x == 0)

return 0;

x--;

// x = 10\*(9\*n + d - 1) + e

int e = x % 10;

x /= 10;

uint64\_t n = 0;

if (e < 9) {

// x = 9\*n + d - 1

int d = (x % 9) + 1;

x /= 9;

// x = n

n = x\*10 + d;

} else {

n = x+1;

}

while (e) {

n \*= 10;

e--;

}

return n;

}

ScriptPubKey size and SriptPubKey fields

Current Bitcoin serialization is using a compressed format to store common types of scripts:

- Pay to Pubkey Hash (P2PKH) is encoded to – 20 bytes
- Pay to Script Hash (P2SH) is encoded to – 20 bytes
- Pay to Pubkey (P2PK) is encoded to – 32 bytes

Other script types are serialized using the general rule shown in the table:

- scriptPubKey size
- scriptPubKey data

Where script size is a VARINT field of len(scriptPubKey data) + number of special cases, where the number of special cases is int(6), then goes the regular scriptPubKey data.

The compression algorithm is removing all OP\_CODE values from the common types of scripts and leaves only the Key, KeyHash, or ScriptHash intact. Then it is using the field to indicate the type of script in the next field.

https://i.stack.imgur.com/NfiEs.png

P2Pk cases: 0x02, 0x03 identify compressed Public Key and show how which  Y coordinate to use to reconstruct the full key.  0x04, 0x05  identify uncompressed Public Key, but the value in the is still in a compressed format and, and the size value shows which Y coordinate to use.

For other script types be sure to keep in mind that is VARINT(len(scriptPubKey size) + int(6))

A small addition to the script compression, while you be looking through the Bitcoin core compress.cpp/.h files you may find that during compression either 20 bytes or 33/65 bytes are being manipulated that are not identified as any opcodes.
That’s because of in bitcoin scripting if a field(4-bit hex value) is smaller than OP_PUSHDATA1 = 0x4c=76 is interpreted as an instruction to push the following number of bytes (the exact value of this field) onto the stack – this is called an immediate push.

As I’m not that good in modern C++, it was much easier  for me to follow the compression algorithm when it was done in Golang:
https://github.com/btcsuite/btcd/blob/master/blockchain/compress.go

and in VarInt logic in Python:
https://github.com/christianb93/bitcoin/blob/1241778dc51bd39022611916f90f1bfaf07ee331/btc/serialize.py#L54

Greg did a really really nice explanation of Var Int, that I found so easy to follow. Thank you.
https://learnmeabitcoin.com/guide/varint

Yes, everything is in the code of original BitcoinCore implementation, I think maintainers and developers are doing epic job to keep it nice and clear. It’s because I was lacking some C++ skills I could not understand it properly from the beginning and had to spend some time on it. I just hope my notes will help someone to speed their learning process. Good luck.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
It looks like in the rev*.dat files there is the message header, followed by the block undo data and then a 256-bit checksum of the block undo data.

I admit I never ran Bitcoin Core but I am a C++ developer so I kinda get what it's trying to do.

We already know that blk*.dat are the bitcoin blocks, and the rev*.dat are the block undo data, for each block I'm guessing.

From the stack exchange question you linked, the rev*.dat file should contain an array of CTxUndo records. Each record pairs with a transaction in the block. But an entry corresponding to the coinbase transaction is not included in the rev*.dat files.

Each of these records is actually an array of inputs (Coin, or whatever is replacing CTxInUndo). This is what looks like is dumped to the file in place of each input. Replace the word "UTXO" with "input" and I think you'll get the big picture.

Quote
    varint: 2*height (+1 if it was a coinbase output): the height of the block that created the spent UTXO
    varint: creating transaction's version [only when height > 0]
    CompressedScript: spent UTXO's scriptPubKey
    CompressedAmount: spent UTXO's nValue
newbie
Activity: 3
Merit: 10

Thank you for providing help.
Judging from this commit https://github.com/bitcoin/bitcoin/commit/cb2c7fdac2dc74368ed24ae4717ed72178956b92, it looks like the way the TxInUndoSerializer class, which I assume replaces CTxInUndo ...

In that commit message (...956b92), the proposer says that
Quote
The earlier CTxInUndo class now holds the same information as the Coin
class. Instead of duplicating functionality, replace CTxInUndo with a
serialization adapter for Coin.
So it was changed to provide additional functionality for Coin class just as you described. Thank you for providing explanations, I need to follow the code for them

Quote
Although I'm unsure whether this is the data that is dumped to rev*.dat files.

You are correct, look: https://github.com/bitcoin/bitcoin/blob/master/src/validation.cpp#L1592

fileout << blockundo;

the naming of the functions sometimes needs the context though - UndoWriteToDisk  Cheesy . The undo block is being appended to the file rev.dat file.

Okay, this needs more effort to understand than I thought.

Thank you


legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Judging from this commit https://github.com/bitcoin/bitcoin/commit/cb2c7fdac2dc74368ed24ae4717ed72178956b92, it looks like the way the TxInUndoSerializer class, which I assume replaces CTxInUndo, is serialized to a file is that it left-shifts the height, which is 0 if it's not the last unspent, by 1, then ORs it with the coinbase flag at the least significant bit, then writes a dummy version number of 0, then it does those same bitwise operations for the Coin member it has and writes that, followed by some more zeros serialized if the height in that Coin isn't zero (for compatibility reasons), and then it serializes the Coin object followed by the CTxOut object the Coin class contains.

Although I'm unsure whether this is the data that is dumped to rev*.dat files.
newbie
Activity: 3
Merit: 10
I’m currently learning how bitcoin operates by studying the BitcoinCore code and several other bitcoin projects, most of them parsers,  and came to a topic of blockchain data storage of blk*.dat and rev*.dat.

While blk*.dat format is more or less documented and the process is described really good (https://learnmeabitcoin.com/guide/blkdat) most of the resources are lacking descriptions of rev*.dat file format and how the process of handling blockchain reorganization (block undo) works. Some Parsers (https://github.com/citp/BlockSci) have rules not to parse new blocks before a certain amount of confirmations so they do not need to deal with the issues of reorganization.

Right now I’m very interested in the logic  of blockchain reorganization and found some descriptions:
https://en.bitcoin.it/wiki/Bitcoin_Core_0.11_(ch_2):_Data_Storage#Raw_undo_data_.28rev.2A.dat.29
And an awesome post reply by Pieter Wuille:
https://bitcoin.stackexchange.com/questions/57978/file-format-rev-dat
I even tried to visualize that answer in a picture:
https://prnt.sc/s4ile4

The problem is that all those explanations are dated because of the update (https://github.com/bitcoin/bitcoin/pull/10195) that replaced CTxInUndo class with Coin class (https://github.com/bitcoin/bitcoin/commit/cb2c7fdac2dc74368ed24ae4717ed72178956b92 )

I know that code is the best documentation, but I still find it hard to read and understand, so I’m asking for help.

  • How does the logic of reorganization work with the rev*.dat and blk*.dat files exactly?
  • Can you walk me through the code, please?
  • What is the format of the rev*.dat files?
  • Can you explain it with a byte map(byte length for each piece) for a block record in it, please?

Apart from learning how everything works, I’m also interested in these topics because they say that it’s possible to get blockstasts faster using rev*.dat files (https://github.com/bitcoin/bitcoin/pull/14802)

I know we have a similar topic (https://bitcointalksearch.org/topic/what-does-undo-data-used-to-do-how-it-works-214898) but it is not very informative.

Jump to: