Author

Topic: bitcoin core 0.11.1 - System error while flushing: Database corrupted (Read 2554 times)

member
Activity: 78
Merit: 10
I couldn't let this go and I finally resolved the problem. I determined that the file corruption issue was resolved by formatting the dedicated SSD stick used to host the block chain (~/bitcoin) with the Flash Friendly File System (f2fs) instead of the Linux ext4 file system. Once that was done the Bitcoin Node running on Raspberry Pi 2 B was stable and has been running now for several days and averaging 32 connections with bitcoin-qt memory usage ranging between 30-50%. The Linux ext4 file system can't handle the load. I also switched to Ubuntu MATE v15.10 (Wily) and Bitcoin Core v0.11.2. I used a 1 MB swap file and its currently being used about 30% (I may bump that to 4 GB). I'm a happy camper with a ~$150 RPi stable functioning Bitcoin Full Node. Wohoo!  Grin

-Ondart
member
Activity: 78
Merit: 10
2112, I totally agree with you. It's a science experiment to attempt to build a stable full bitcoin node for about $150 with room to double the size of the block chain on disk and have it reliable and on 24/7. The allure of $20 a year in electricity costs is a factor. However if I can run these SSD read/write tests for days on end without error on the Pi, and with my observation that the CPU with bitcoind running was averaging 25-50% and memory stable and rarely using the page file, I would challenge that there is not a hardware problem, it may be something else, and this may work. I certainly could be wrong. I've got it running now with the looping md5 hash script and intend to run it for a few days.

If the SSD hardware proves to be failing then your suggestion using a disk with a power supply is a good one. On the flip side however, I think this System76 unit would be the way to go. I've been eye balling it for a while now. It might be a good candidate for a never touched Cold Storage Wallet or a Bitcoin Full Node.

https://system76.com/desktops/meerkat

-Ondart

legendary
Activity: 2128
Merit: 1073
What other logs do you recommend I monitor for hardware errors?
On Raspberry Pi there's absolutely nothing in it that does error detection. This is a piece of hardware that is marginal in every respect, including minimalist quality assurance of the shipped products.

Think of RPi as RAIC: Redundant Array of Inexpensive Computers. You really need several of them (at least 2) running in parallel and constantly compare results.

The simplest thing to do in your case is to replace USB SSD with an USB connected spinning disk drive with separate power supply. It will take a little bit longer to initially synchronize but it will exclude the problem of a weak USB power supply.

The other way is to hook up your USB SSD through a separately powered USB hub.

Anyway, you really need to realize that if you think of running a financial application on a computer without ECC RAM you are in the state of sin. It is a road to nowhere except how to learn troubleshooting and fault localization skills.

member
Activity: 78
Merit: 10
Quote
recipe for triggering errors in disk controllers

Hey 2112, thanks for the idea of heavily exercising my SSD. I didn't get as sophisticated as your concept to use the output as input to generate a never ending process that would fill the disk. But, I did use your idea to take the existing blocks on the SSD and md5hash them to individual files containing that MD5 hash in a test folder on the same SSD. This is effectively reading, processing and writing out to the same SSD. This should run long and hopefully show me if there is a hardware problem I would think. I am not sure where I will go next if this produces no errors. I am tailing the syslog and the user.log. What other logs do you recommend I monitor for hardware errors?

Here is my script for reference:

Code:
#!/bin/bash

cd /media/pi/BTC/.bitcoin/blocks
input=*
output=/media/pi/BTC/test
n=1
while (( $n  >= 0 ))
do
 if [ ! -d $output/dir$n ]; then
  mkdir $output/dir$n
 fi
 for f in $input
 do
   echo $f
   md5sum $f | cut -d ' ' -f 1 > $output/dir$n/$f.md5
   cat $output/dir$n/$f.md5
 done
 n=$(( n+1 ))
done

One thing to note also is that I am using a 128 GB Corsair Voyager GTX flash drive. It has an embedded SSD controller, the reason I went with it. However, unlike normal SSDs there is no power  to apply. I am wondering if the Pi has enough current to handle it and if not maybe it is a potential root cause.
-Ondart
legendary
Activity: 2128
Merit: 1073
what else can I do to either pin it down that there is an actual hardware error
Make sure you stress simultaneous reading and writing to the same drive and that the drive is the bottleneck in both directions, not the network transfer.

My general (not bitcoin specific) recipe for triggering errors in disk controllers is as follows:

1) keep the source tree (e.g. ~/.bitcoin) on the local disk that is mounted read-only.
2) it is acceptable to have source over the net only if the net is gigabit
3) recursively compute MD5 checksum of the source tree
Code:
cd ~/.bitcoin; find . -depth -type f -exec md5sum "{}" \;
4) start filling out the target disk by doing the recursive copy, not with "cp -r" or "scp -r" but with piped tar "tar cf - . | tar xvf -" or cpio "find . -depth | cpio -o | cpio -i"
5) as soon as the first copy is made spawn running in parallel MD5 checksum verification of the sums computed in step (3)
6) continue running step (4) and (5) until the last copy fail with target disk nearly 100% full
7) remove exactly one target copy making room for exactly one more copy of source
8) keep doing the above over a weekend

I'm not up to speed about current SSD market, but in the past only Intel SSD drives survived this kind of torture. Only Intel from the "reasonably priced" segment, we also tested some models from "ridiculously expensive" segment.  Our test also involved doing very similar things running through commercial database engines over many drives in parallel (not as RAID but as FILEGROUP).

member
Activity: 78
Merit: 10
I've got a very similar problem but mine seems to happen even after I have a synced block-chain side loading it from another system. I am running v0.11.1.0-0720324 of Bitcoin Core on Raspbian Jesse. I've got a new Corsair SSD attached and have run extended SMART Data Self-Tests without errors. To exercise the disk for a long period, I have purposely SSH transferred the entire block-chain from my Ubuntu machine to the Raspberry Pi SSD over SSH and it ran for over 24 hours without error. I have checked the system logs and there are no hardware errors. I can insert the SSD directly on my Ubuntu workstation and transfer the block chain to it for hours without error.

Whenever I start bitcoind or bitcoin-Qt, it will crash within a 10-15 minutes with the following errors. I have throttled the external connections to the bitcoin full node to 15 total to lighten the load with no change in behaviour. I really don't think I have a hardware issue but what else can I do to either pin it down that there is an actual hardware error or find the problem with my version of Bitcoin Core or a problem with LevelDB?

2015-10-31 05:32:16 ERROR: CScriptCheck(): 2fd54cf208fc7f38a53d49fbfc49261d801bea5eed4f0f7103f385ab5dbd43fe:0 VerifySignature failed: Non-canonical signature: S value is unnecessarily high
2015-10-31 05:32:16 ERROR: AcceptToMemoryPool: ConnectInputs failed 2fd54cf208fc7f38a53d49fbfc49261d801bea5eed4f0f7103f385ab5dbd43fe
2015-10-31 05:32:31 ERROR: CScriptCheck(): 6b611a1ee1958f9fe369764e4b2aede0f93da2dfb608247a31d23bf548465bda:1 VerifySignature failed: Non-canonical signature: S value is unnecessarily high
2015-10-31 05:32:31 ERROR: AcceptToMemoryPool: ConnectInputs failed 6b611a1ee1958f9fe369764e4b2aede0f93da2dfb608247a31d23bf548465bda
2015-10-31 05:33:09 LevelDB read failure: Corruption: block checksum mismatch
2015-10-31 05:33:09 Corruption: block checksum mismatch


When I restart I get Error opening block database. Do you want to rebuild the block chain now? After cancelling this I get the same error:

2015-10-31 05:40:45 init message: Verifying Blocks. . .
2015-10-31 05:40:45 Verifying last 288 blocks at level 3
2015-10-31 05:40:53 LevelDB read failure: Corruption: block checksum mismatch
2015-10-31 05:40:53 Corruption: block checksum mismatch
2015-10-31 05:41:29 Aborted block database rebuild. Exiting.
2015-10-31 05:41:29 scheduler thread interrupt
2015-10-31 05:41:29 Shutdown: In progress...
2015-10-31 05:41:29 RPCAcceptHandler: Error: Operation canceled
2015-10-31 05:41:29 StopNode()
2015-10-31 05:41:29 Shutdown: done




staff
Activity: 3458
Merit: 6793
Just writing some code
Maybe you have a hardware problem. The database could be getting corrupted from failing hardware. Try syncing to another disk.
full member
Activity: 360
Merit: 100
I've tried re-downloading the blockchain multiple times  - fails at different points.   

Really becoming very annoying ... almost too painful to bother with core anymore.  WTF?

very high end PC with tons of ram and ssd disk -   

been running a full node for years but lately seems every other week I have to start from scratch.

thoughts?

reindexing doesn't ever finish either --

2015-10-29 17:39:32 UpdateTip: new best=000000000000000034bb5058b8092dd23430d1fe718800951a8ba8e568387eb0  height=302865  log2_work=78.832045  tx=39575860  date=2014-05-27 13:06:07 progress=0.248851  cache=57.6MiB(11642tx)
2015-10-29 17:39:32 LoadExternalBlockFile: Processing out of order child 00000000000000001b7dd8b9f447e34719dd17d63fac422e1c29b1dfc856faf4 of 000000000000000034bb5058b8092dd23430d1fe718800951a8ba8e568387eb0
2015-10-29 17:39:33 UpdateTip: new best=00000000000000001b7dd8b9f447e34719dd17d63fac422e1c29b1dfc856faf4  height=302866  log2_work=78.832165  tx=39576346  date=2014-05-27 13:15:25 progress=0.248862  cache=61.6MiB(12511tx)
2015-10-29 17:39:33 Corruption: block checksum mismatch
2015-10-29 17:39:33 *** System error while flushing: Database corrupted
2015-10-29 17:55:22 ERROR: ProcessNewBlock: ActivateBestChain failed
2015-10-29 17:55:22 socket sending timeout: 1467s
2015-10-29 17:55:22 opencon thread interrupt
2015-10-29 17:55:22 addcon thread interrupt
2015-10-29 17:55:22 scheduler thread interrupt
2015-10-29 17:55:22 net thread interrupt
2015-10-29 17:55:23 msghand thread interrupt
2015-10-29 17:55:23 Shutdown: In progress...
2015-10-29 17:55:23 StopNode()
2015-10-29 17:55:23 Corruption: block checksum mismatch
2015-10-29 17:55:23 *** System error while flushing: Database corrupted
2015-10-29 17:55:29 Shutdown: done

Jump to: