Bit-error in Block 108009, Tx 23 ? - page 2.

maaku

legendary

Activity: 905

Merit: 1014

Here's paper on the frequency of random-bit errors:

http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

If you are doing anything involving transactions of real-world value and you are not using ECC and RAID, you're doing it wrong. If you're handling transactions of extremely high value and you are not using redundant computation (asking multiple machines to generate/process transactions, and comparing results), you're doing it wrong.

The database systems that handle transactions for stock exchanges, air travel, heath devices, etc. have redundant storage, redundant memory, and redundant processing. Google "HP NonStop" for a good commercial example.

etotheipi

legendary

Activity: 1428

Merit: 1093

Core Armory Developer

Quote from: maaku on September 20, 2011, 03:04:56 PM

If you're this thorough about everything in life, I'm surprised you haven't encountered such an error before. Random one-bit errors do happen--I work with big data on consumer hardware and encounter this sort of thing about once a year or so. (That's why ECC memory and RAID exists.) With networks the situation is even worse, because there are some combinations of hardware routers and firmware versions that cause checksum fields to be overwritten rather than verified--and the chances of a transmission error over a network is much larger.

Well, when it comes to Bitcoin, hashing, cryptography, and millions of dollars, I don't have much of a choice but to be extremely thorough. There is no "partial credit," and the cost of mishandling large transactions is not worth the time saved by taking shortcuts. I want to get this stuff right from the start.

This problem is the kind of thing that happens so rarely, it may not be handled elegantly in the current client. I'm actually more concerned with the fact that I've made an assumption about blk0001.dat that, if it doesn't hold, will botch my tx scanning code. If it's an extreme boundary condition, then perhaps I can just work on detecting it and redownload/rescan the chain when it does. It's annoying, but happens infrequently enough to not cause major problems.

maaku

legendary

Activity: 905

Merit: 1014

If you're this thorough about everything in life, I'm surprised you haven't encountered such an error before. Random one-bit errors do happen--I work with big data on consumer hardware and encounter this sort of thing about once a year or so. (That's why ECC memory and RAID exists.) With networks the situation is even worse, because there are some combinations of hardware routers and firmware versions that cause checksum fields to be overwritten rather than verified--and the chances of a transmission error over a network is much larger.

That said, there's no reason for the client to re-verify the transaction until it is actually used. I would hope that it would then discover the error and request the correct version from another node.

etotheipi

legendary

Activity: 1428

Merit: 1093

Core Armory Developer

So if the error occurred in a block with less than 2500 confirmations, the client would find it and correct it in-place? What if the new block ends up being a different size than the original block on disk?

Last week, I was trying to decide if I could make the assumption that I will always be able to read the blocks in the file order of blk0001.dat and know that all TxIns read will reference a TxOut that I've seen before. If the block data is corrected by appending to the end of the file, this assumption fails. But it also seems like it would be "difficult" to correct the block in place if it's a different size than the original, erroneous block data.

Until spending four hours looking for this bit-flip, I had actually ruled out any possibility of the blocks being out of order. If I can't make this assumption, tx scanning gets more complicated, since TxIns don't always identify the redeeming address. You have to look at its TxOut to know for sure. If the correct TxOut is after the TxIn, then you have no way to tell if it's yours. Perhaps requiring two scans of the block chain to find all your transactions...

theymos

administrator

Activity: 5222

Merit: 13032

I believe that the error will be detected and fixed if you run Bitcoin with -checkblocks. Bitcoin only checks the last 2500 blocks by default.

etotheipi

legendary

Activity: 1428

Merit: 1093

Core Armory Developer

My question is: what is the client supposed to do when this happens? It has the correct block headers, but doesn't read the correct tx. So does he go to the network and request the tx list for that block, again? If so, then wouldn't the replacement be appended to the end of the file, effectively making blk0001.dat file contain out-of-order block data?

I've never experienced a bit error like this in any application on any computer, ever (or at least never noticed). It seems like one of those boundary conditions where I bet the client isn't prepared to handle this...

maaku

legendary

Activity: 905

Merit: 1014

Bit atrophy? A stray cosmic ray?

etotheipi

legendary

Activity: 1428

Merit: 1093

Core Armory Developer

While testing my blockchain scanning tools I have been able to match up block explorer on every single blockheader hash and tx hash except for a single transaction. I get the correct answer for 1.5 million transactions, and the wrong answer for exactly 1. I have spent a lot of time investigating this and I believe I've narrowed it down to a single bit error!?! How on earth this would happen? Shouldn't the client choke on this? Does anyone else have this?

It is in Block 108009, in transaction #23. In the blk0001.dat file, the tx starts at byte 77,582,676 and is 258 bytes long. More specifically, the byte in question should be a sequence of four zero-bytes around byte 77,582,930.

Here is the raw transaction as it appears in my blockchain:

Code:

01000000 01a359fe 89ccaf10 07cb285a f28ae745 548fa4e6 b8424eab 70f6bab9 
fae180fa af000000 008b4830 45022058 386a1553 2b610495 db44a8a6 3647d03f 
4ec3ed62 89cfd016 2cb634a5 14514302 2100f7f1 f795dd0e 955aa398 f5e01397 
0ffa7bc4 ef9296b7 c20e6738 525dda2b 3e150141 04dcf7a3 14525ad5 9e749990 
cddd7e7d e0dd4fca 7b77d5f1 7eead167 f0f51856 27a8354b 83a50384 495f37bb 
1463bf6d 11052392 ff6003aa f230035c 4dea8b3e b2ffffff ff02408f e1080000 
00001976 a914c515 97215026 9a7761be ea7a4ecd c750de5f 87d888ac c0ccd208 
01000000 1976a914 e5484641 d51e83d6 7882329f 7cc4c723 69c8db13 88ac0004 0000

The last 4 bytes (8 characters) represent nLockTime. This transaction appears to have an nLockTime of 0x00040000 (1024)!?! BlockExplorer says it should be 0 (as expected). This couldn't be malicious, because nLockTime is disabled and an nLockTime of 1024 in block 108009 is pointless. This has been driving me crazy, and now I am baffled how my client has let me get away with this. What is the client supposed to do if it reads a tx from the blk0001.dat that doesn't match the merkle root in the header?

Topic: Bit-error in Block 108009, Tx 23 ? - page 2. (Read 3707 times)