Pages:
Author

Topic: Bit-error in Block 108009, Tx 23 ? (Read 3649 times)

full member
Activity: 140
Merit: 100
September 22, 2011, 12:22:49 AM
#28
would putting your hardware inside a properly grounded faraday cage reduce the number of errors? alon

Purchasing a server from a respected company instead of a desktop computer from e-machines would go a long way to reduce the number of errors. Bit rot is a common problem on desktop computers, particularly if they are overclocked or you are pushing the memory to it's limit. Just because it 'seems' stable you don't have any guarantee random memory errors aren't occurring.
legendary
Activity: 905
Merit: 1011
September 21, 2011, 09:29:27 PM
#27
No, that would neither stop cosmic rays (too much energy), nor atmospheric neutrons (no charge). No simple solution I'm afraid Sad
sr. member
Activity: 350
Merit: 251
September 21, 2011, 07:33:45 PM
#26
would putting your hardware inside a properly grounded faraday cage reduce the number of errors? alon
sr. member
Activity: 322
Merit: 251
September 21, 2011, 01:11:44 AM
#25

If this were a hardware error, even a 1-in-a-trillion one, I would expect his machine to be unstable, locking up constantly, his file system growing errors on its own and constantly needing to be fsck'd/repaired/etc.  Software errors can be responsible of errors that occur with literally any magnitude of frequency, from one in a zillion all the way to constantly.


I think the experience of myself and maaku differs from yours. I had a similar belief during my first 15 years of working in the computer industry.  I did a lot of things with a lot of data and never encountered an obvious internal bit error.

However, in the last ten years I have worked more closely with hardware and I've seen enough errors to believe that hardware is a perfectly reasonable possible source of error in this case. The recent DEFCON paper on bitsquatting awakened a lot of people to the potential for hardware bit errors. Bandwidth of various transmission types, external and internal, and storage capacity has increased so dramatically, that bit errors are popping up in places that previously were so rare as to be ignorable.

I have personally diagnosed a repeatable bit error which flipped a particular bit of the destination address of a DMA transfer on a PCI bus. It only occurred once every several hours on a heavily loaded network device. It caused one connection to be dropped and there were no other obvious ill effects.

Error correction technology is a slam dunk solution for these types of problems, but it has not yet been applied to all the systems that need it.




 

Here's the paper this man is referring to: http://good.net/dl/k4r3lj/DEFCON19/DEFCON-19-Dinaburg-Bit-Squatting.pdf

As a geek I thoroughly endorse this conversation. Smiley
legendary
Activity: 905
Merit: 1011
September 21, 2011, 12:57:49 AM
#24
by the same logic, the block chain is full of H0E's too.
That made me smile  Cheesy

EDIT: It is possible to lose money, if the bitflip happens in the receiving address hash of the TxOut in the time between which the script is composed and the hash of the entire transaction is computed. But *is* exceedingly unlikely to occur, and could be checked for in the client anyway before broadcasting the tx. If I were running an exchange, it'd be the least of my concerns.
member
Activity: 64
Merit: 140
September 21, 2011, 12:40:34 AM
#23
I seriously doubt that anyone would lose money as a result of this kind of problem but investing in a light background continuous valdation process and some sort of reporting to the end user like a message box that provides instructions for reporting the presence of corrupted blocks...

.. might be a good idea.  (sorry forgot to finish that sentence)
member
Activity: 64
Merit: 140
September 21, 2011, 12:34:57 AM
#22
I've seen enough errors to believe that hardware is a perfectly reasonable possible source of error in this case.

I am fully in agreement that hardware could possibly be the cause.  Perhaps what I'm really trying to say could be summarized like this.

  • Because this software is in beta, it too could be the cause, in fact this is very likely for that reason alone.  I have had it crash for no reason at all (usually it complaining about its own database files after an improper shutdown).  This could be happening to everybody and we may just not know it.
  • RAID is not a solution to the specific conjecture offered.  If this were a hardware error under identical circumstances, RAID would have given no benefit, just because of what RAID is and isn't.
  • Software issues that could contribute to this include the following: misuse of stray pointers, accessing freed memory, threading-related issues, buffer overruns.  Or, it could be hardware.
  • A potential tool to help rule out software issues might be to distribute this blockchain verification code and have others run it.  I'd run it.  Who knows.  Maybe my copy of the block chain will have a similar kind of corruption in a different block.  If it did, then there's likely a software gremlin lurking.




The key is the nature of the corruption, which was described as a bit error. Honestly, I did not read the initial post closely enough, and thought it was a bit flipped in a hash value, but rereading I see the error is a single bit value that should be a zero across four bytes.  That could have be an _overwrite_ of a 32 bit memory location, since zero is an extremely common value and single bit values are fairly common. There is no clear connection between the two.

A hash bit error, in contrast, would increase my suspicion of hardware. Otherwise software must have read the value, flipped a specific bit (or ORd it against a mask), and wrote it back. The crypto stuff might do that but other type of software errors are unlikely to produce that effect. But that debate can be tabled until hash bit errors or other obvious bit errors are reported.

I seriously doubt that anyone would lose money as a result of this kind of problem but investing in a light background continuous validation process and some sort of reporting to the end user like a message box that provides instructions for reporting the presence of corrupted blocks. If that reveals a significant problem (beta testing would probably be enough if that is the case) then the remedy would be an error correction method to replace corrupted blocks from the network (and more instrumentation to try to isolate any software causes).
vip
Activity: 1386
Merit: 1136
The Casascius 1oz 10BTC Silver Round (w/ Gold B)
September 21, 2011, 12:04:06 AM
#21
Random observation from perusing the raw block chain file: for the number of times it contains messages directed to Luke-Jr stating that "god does not exist", from the view of a hex editor, the sequence "G0D" (in ascii) sure appears a god-awful lot in the block chain...

(it seems to occur each place a ScriptSig starts out with the bytes 0x30 0x44)...

by the same logic, the block chain is full of H0E's too.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
September 20, 2011, 11:11:39 PM
#20
It would take a little bit longer, but it seems it could be checked very reliably through a simple header+merkle root scan.  The blk0001.dat file has the following form:

Code:
MagicBytes(4) |  NumBytesInBlock(4) | Header (80) | NumTx(var_int) | Tx1 | Tx2 | ... | TxN |
MagicBytes(4) |  NumBytesInBlock(4) | Header (80) | NumTx(var_int) | Tx1 | Tx2 | ... | TxN |
...

If there's a bit error in a header, it won't have the leading zeros.  If there's a bit error in a transaction, the merkle root won't match up.  We know the magic bytes in advance, and if the numBytes or numTx values are off, our parser will crash.  With an efficient algorithm for scanning the blockchain, the entire check could be done in less than a minute.
vip
Activity: 1386
Merit: 1136
The Casascius 1oz 10BTC Silver Round (w/ Gold B)
September 20, 2011, 10:54:56 PM
#19
Someone who downloads the blockchain from scratch will get a blk0001.dat file that contains only blocks on the main chain.  However, after the initial download, if you leave your client running, you will invariably pick up the occasional invalid block which will get stored in the blk0001.dat file (because you don't know it's going to be invalid until it's already been added to your file).  This means that your calculated hash will only match between two people if they both just downloaded the blockchain.

Given that, maybe this might be of some use, marginal, I suppose... but if you just downloaded a month ago and there's a difference before byte 200,000,000 such as this one at offset 77,000,000 then it might be worth a further look.  Do you know how old is the block chain file you are using?  My blockchain on this particular machine most likely started downloading on 6/24/2011.

Code:
root@ubuntuZK:~/.bitcoin# head -c 100000000 blk0001.dat | sha256sum
75be1a1d816c11466cdf795cdf52d838fdeaa75b5b585fb1b9846d0f840b5322  -
root@ubuntuZK:~/.bitcoin# head -c 200000000 blk0001.dat | sha256sum
e01158d4b962fa0aa99f16565505f9653cfd6662540f4df44c6cbe1e28b87b6b  -
root@ubuntuZK:~/.bitcoin# head -c 300000000 blk0001.dat | sha256sum
9dfa7d83cf98cd25008c67a549e3baf08916c28289e3e702b5a0347f3d71d2a4  -
root@ubuntuZK:~/.bitcoin# head -c 400000000 blk0001.dat | sha256sum
11131852dcd16fe60a7e28b122732f89af923fa39bac41695abe220f7cc61961  -
root@ubuntuZK:~/.bitcoin# head -c 500000000 blk0001.dat | sha256sum
3e335d6e1720236512d7676f53b1d821d1a2f023a3b61b0ddae4d68cb0542ec2  -
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
September 20, 2011, 10:50:12 PM
#18
Someone who downloads the blockchain from scratch will get a blk0001.dat file that contains only blocks on the main chain.  However, after the initial download, if you leave your client running, you will invariably pick up the occasional invalid block which will get stored in the blk0001.dat file (because you don't know it's going to be invalid until it's already been added to your file).  This means that your calculated hash will only match between two people if they both just downloaded the blockchain.

administrator
Activity: 5166
Merit: 12850
September 20, 2011, 10:44:45 PM
#17
EDIT: Here might be a good way for any volunteers to quickly rule out this same thing.

If I do:
Code:
head -c 610000000 blk0001.dat | sha256sum

Then we should all get the same results, right?  (my blk0001.dat is currently 617037299 bytes long)
Mine is 808717dfd1a8af65b60243c4278aff43454fb8df3b3dc597df67265acf851642.

This is not reliable because orphan blocks are also stored there. My hash is different.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
September 20, 2011, 10:39:38 PM
#16

If this were a hardware error, even a 1-in-a-trillion one, I would expect his machine to be unstable, locking up constantly, his file system growing errors on its own and constantly needing to be fsck'd/repaired/etc.  Software errors can be responsible of errors that occur with literally any magnitude of frequency, from one in a zillion all the way to constantly.


Error correction technology is a slam dunk solution for these types of problems, but it has not yet been applied to all the systems that need it.

Luckily, in the context of bitcoin, this isn't so critical.  There's redundancy everywhere, since there are hashes for everything, and a million other nodes checking your solution before it's accepted.  The only real risk is getting your client dorked up, submitting what you think is a valid transaction and being confused when it's not accepted by the network.  I don't think millions of dollars will be lost in Bitcoins due to this, but certainly, a client that can't recover from a bit error in the blockchain file could cause all sorts of problems for a high-volume user/company that would lose money from downtime.

I have never appreciated the value of ECC RAM, because I've done computation on computers for 10 years without ever noticing an explicit error.  Although, I can see how certain applications need the guarantees against it.  Regular RAM errors are measured in errors/day, whereas ECC RAM errors are measured in errors/century.  
vip
Activity: 1386
Merit: 1136
The Casascius 1oz 10BTC Silver Round (w/ Gold B)
September 20, 2011, 10:29:52 PM
#15
I've seen enough errors to believe that hardware is a perfectly reasonable possible source of error in this case.

I am fully in agreement that hardware could possibly be the cause.  Perhaps what I'm really trying to say could be summarized like this.

  • Because this software is in beta, it too could be the cause, in fact this is very likely for that reason alone.  I have had it crash for no reason at all (usually it complaining about its own database files after an improper shutdown).  This could be happening to everybody and we may just not know it.
  • RAID is not a solution to the specific conjecture offered.  If this were a hardware error under identical circumstances, RAID would have given no benefit, just because of what RAID is and isn't.
  • Software issues that could contribute to this include the following: misuse of stray pointers, accessing freed memory, threading-related issues, buffer overruns.  Or, it could be hardware.
  • A potential tool to help rule out software issues might be to distribute this blockchain verification code and have others run it.  I'd run it.  Who knows.  Maybe my copy of the block chain will have a similar kind of corruption in a different block.  If it did, then there's likely a software gremlin lurking.

EDIT: Here might be a good way for any volunteers to quickly rule out this same thing.

If I do:
Code:
head -c 610000000 blk0001.dat | sha256sum

Then we should all get the same results, right?  (my blk0001.dat is currently 617037299 bytes long)
Mine is 808717dfd1a8af65b60243c4278aff43454fb8df3b3dc597df67265acf851642.

member
Activity: 64
Merit: 140
September 20, 2011, 09:45:18 PM
#14

If this were a hardware error, even a 1-in-a-trillion one, I would expect his machine to be unstable, locking up constantly, his file system growing errors on its own and constantly needing to be fsck'd/repaired/etc.  Software errors can be responsible of errors that occur with literally any magnitude of frequency, from one in a zillion all the way to constantly.


I think the experience of myself and maaku differs from yours. I had a similar belief during my first 15 years of working in the computer industry.  I did a lot of things with a lot of data and never encountered an obvious internal bit error.

However, in the last ten years I have worked more closely with hardware and I've seen enough errors to believe that hardware is a perfectly reasonable possible source of error in this case. The recent DEFCON paper on bitsquatting awakened a lot of people to the potential for hardware bit errors. Bandwidth of various transmission types, external and internal, and storage capacity has increased so dramatically, that bit errors are popping up in places that previously were so rare as to be ignorable.

I have personally diagnosed a repeatable bit error which flipped a particular bit of the destination address of a DMA transfer on a PCI bus. It only occurred once every several hours on a heavily loaded network device. It caused one connection to be dropped and there were no other obvious ill effects.

Error correction technology is a slam dunk solution for these types of problems, but it has not yet been applied to all the systems that need it.




 
administrator
Activity: 5166
Merit: 12850
September 20, 2011, 09:34:48 PM
#13
So if the error occurred in a block with less than 2500 confirmations, the client would find it and correct it in-place?  What if the new block ends up being a different size than the original block on disk? 

Bitcoin will fix it by invalidating the broken block and all later blocks. Then it will redownload them all. I'd guess that a copy of the broken blocks and all later blocks would be appended, though I am not very familiar with Bitcoin's database. Try it and see what happens.

This kind of error is not very important. If corruption causes you to make a wrong network decision, then your client will detect that an invalid chain is longer than your chain and your client will go into safe mode.
vip
Activity: 1386
Merit: 1136
The Casascius 1oz 10BTC Silver Round (w/ Gold B)
September 20, 2011, 09:01:49 PM
#12
@casascius: RAID with any kind of redundancy (i.e, not RAID-0 or JBOD) *does* compare reads and notify of an error and/or fill in the gaps if possible.

That is true if a drive is reporting failure, but isn't true if a drive is simply returning wrong data and reporting success (assuming that actually happened), which would be consistent with a "bit failure" scenario that ignores the built-in coding redundancy present in all modern hard drives.

The most the RAID system would see in the event of a "wrong read" is incorrect parity/qvalues(raid5,6) or an inconsistent mirror (raid1).  The most it could do is log the discrepancy advising the operator to do a full-disk verify, start one on its own, and/or recalculate the parity/qvalues(raid5,6) or choose one version of the mirror to overwrite the other, to eliminate the inconsistent-state condition of the volume.

None of the common RAID setups (0,1,5,6,10,50,60) is equipped by design to validate or correct the output of a hard drive that is reporting successful reads but returning incorrect data.  The RAID system relies on the failure reports of non-functioning drives (or the observation that one or more is missing) to know which drive(s) it needs to work around.

If this were a software error, you'd see errors a lot more often than 1-in-a-trillion+.

There are only 15 million transactions per the OP, so it is not clear to me what the 1-in-a-trillion figure is based on.  This transaction could easily be the results of someone experimenting with the network.  If this were a hardware error, even a 1-in-a-trillion one, I would expect his machine to be unstable, locking up constantly, his file system growing errors on its own and constantly needing to be fsck'd/repaired/etc.  Software errors can be responsible of errors that occur with literally any magnitude of frequency, from one in a zillion all the way to constantly.

legendary
Activity: 905
Merit: 1011
September 20, 2011, 08:16:20 PM
#11
@casascius: RAID with any kind of redundancy (i.e, not RAID-0 or JBOD) *does* compare reads and notify of an error and/or fill in the gaps if possible.

If this were a software error, you'd see errors a lot more often than 1-in-a-trillion+.

@iamzill: why would Philip K. Dick's estate sue you? that's exactly how high-reliability systems have worked since the invention of the computer.
vip
Activity: 1386
Merit: 1136
The Casascius 1oz 10BTC Silver Round (w/ Gold B)
September 20, 2011, 07:07:19 PM
#10
I think it is extremely unlikely that this is a hardware-related bit error, compared to the probability that this still-in-beta software actually wrote the file as it is.

Hard drives already use ECC and Reed-Solomon (or similar) encoding internally, even when they're not RAID.  That said, RAID offers no protection against the possibility of a hard drive responding to reads with "success" but the wrong data - something hard drives generally don't do.  RAID only steps in to fill in the gaps when drives report failures and/or an inability to read sectors.

In my view, the odds that the software abused a pointer and overwrote the data in question, or has a bug and isn't behaving like it's supposed to, is about a kazillion times greater than the odds that a stray blip of radiation hit the DRAM in just the right spot and set the bit for the brief interval between receipt of this block and it being flushed to disk.
sr. member
Activity: 677
Merit: 250
September 20, 2011, 06:52:43 PM
#9
How about running 3 computers, each with their own client and blk0001.dat and a copy of your scanner. When an error occurs take a democratic vote.  If the error rate is X then this system will reduce the error rate down to X*X.

I hope Philip K. Dick's estate doesn't sue me for this post  Grin
Pages:
Jump to: