Pages:
Author

Topic: Blockchain corruption during power loss? (Read 2384 times)

legendary
Activity: 1526
Merit: 1134
July 08, 2013, 11:20:58 AM
#30
Yeah, it does do at least fdatasync in some cases. OK, actually, I don't know the exact requirements LevelDB makes of the OS, but I really doubt it's anything un-POSIXish.
legendary
Activity: 1596
Merit: 1100
LevelDB is designed to survive certain kinds of file system corruption by ensuring that all commits are atomic .... on the assumption that file system renames and write() calls are atomic. If that underlying OS assumption is violated by bugs in the OS or hardware, LevelDB can corrupt itself as would any other database.

Write calls have never been atomic in any Unix-ish OS...  They may be reordered by the OS between fsync/fdatasync calls, and may be reordered again at the hardware (disk) level, unless the OS sends a hardware flush command (FLUSH CACHE / SYNCHRONIZE CACHE).

legendary
Activity: 1526
Merit: 1134
The issue is not suspend/resume when it works, which LevelDB can survive just fine, it's when the OS or hardware itself screws up the resume process.

ACPI suspend/resume is unbelievably complicated, at least the way it used to work required the BIOS to provide an actual program written in a special kind of assembly language run by an ACPI interpreter as part of bringing the system down/up. If anything goes wrong in that process, all bets are off - pretty much anything could happen to the data on disk.

LevelDB is designed to survive certain kinds of file system corruption by ensuring that all commits are atomic .... on the assumption that file system renames and write() calls are atomic. If that underlying OS assumption is violated by bugs in the OS or hardware, LevelDB can corrupt itself as would any other database.
legendary
Activity: 1596
Merit: 1100
But the simpler fix is to just not run Bitcoin-Qt on Macs that might sleep a lot.

If leveldb cannot handle suspend/resume with full data integrity, then we may need to revisit it.

legendary
Activity: 1526
Merit: 1134
You shouldn't have to download, reindexing is enough. Whether it's possible to handle more gracefully really depends on the exact details of what goes wrong.

What we could potentially do is make rolling backups of verified-consistent databases, and then just roll back to the last database then replay from that point onwards. So it'd reduce the amount of reindexing time.

But the simpler fix is to just not run Bitcoin-Qt on Macs that might sleep a lot.
legendary
Activity: 1372
Merit: 1008
1davout
Yeah so the Mac Mini went into a sleep state just like a laptop did. I think the Mini vs laptop distinction is a distraction here, the root cause is failed unsuspends which leads me to wonder if it's even solvable by LevelDB or us. If the machine just fails to wake up then it's obviously hosed internally in some bad way.

Well, the root cause is obviously the Mac, right.
The bug on bitcoind's side though, IMO, is that it forces you to go through hours and hours of reindexing redownloading when this kind of stuff happens instead of handling it more gracefully.


legendary
Activity: 1526
Merit: 1134
Yeah so the Mac Mini went into a sleep state just like a laptop did. I think the Mini vs laptop distinction is a distraction here, the root cause is failed unsuspends which leads me to wonder if it's even solvable by LevelDB or us. If the machine just fails to wake up then it's obviously hosed internally in some bad way.
legendary
Activity: 1372
Merit: 1008
1davout
Is the Mac Mini actually entering a sleep state of some kind? You said it happens when the machine comes back from suspend. Now your machine is just "idling". So which is it? If the computer is just running normally then that'd imply spontaneous random destruction of the db, which I've not seen myself.

The steps that I used to reproduce it before giving up on bitcoind on the Mac mini :

 - Have a good evening coding around, listening to some nice music and stuff,
 - go to bed
 - come back to computer with cup of coffee
 - fail to bring computer back up from idle/sleep/suspend or whatever a mac does when you leave it alone for a while
 - hard reboot it
 - rage at bitcoind casually telling me that the blockchain is corrupted and that i need to reindex

Conclusion : in the presence of coffee, leveldb spontaneously self-destructs.
legendary
Activity: 1526
Merit: 1134
Well, your last paragraph explains why I'm not myself working on pruning right now Wink   (also sipa said he'd do it).

Is the Mac Mini actually entering a sleep state of some kind? You said it happens when the machine comes back from suspend. Now your machine is just "idling". So which is it? If the computer is just running normally then that'd imply spontaneous random destruction of the db, which I've not seen myself.
legendary
Activity: 1372
Merit: 1008
1davout
I also encountered a corrupt LevelDB and it also appeared to be a suspend related issue. My guess - power management on modern Macs is buggy and is likely to cause the file system to lose its integrity in some way. The fact that Mac's do sometimes just die and refuse to unsuspend strongly suggests the presence of fatal errors in their implementation. Recent OS X versions are sloppy in other ways - when the laptop lid is opened and the unsuspend process begins, the first thing it does is display a screenshot of the password entry screen! Of course it's not actually usable for many seconds so any keypresses you make get thrown away. This kind of duplicitous nonsense is classic Apple. Now think - if your power management engineering team is the kind that'd make such a decision, do you trust them to get the details 100% right? I wouldn't.

All that said running nodes on systems that come and go all the time is hardly helping the network and most users will get tired of it sucking up battery and other resources. I can't see running full nodes on laptops being popular in the long run. So fixing this doesn't seem to be very important to me, certainly it shouldn't be seen as blocking pruning. If it's robust on Linux servers, that's the most important thing.

It's not a laptop it's a Mac Mini that simply goes idle from time to time, for example at night when unused.
And bitcoind should probably be a little more resilient than "oh really, you let your computer go idle, let's just re-download the whole chain".

If you think bitcoind should only be resilient on Debian stable in a well-connected datacenter you're going to keep seeing the general decline in nodes that is being experienced.

If you reason this way why would you want to implement pruning at all? After all if bitcoind runs fine on a server with an i7, 1To disk, 32 Go RAM and a 1Gbps connection that's the most important thing right ?
legendary
Activity: 1526
Merit: 1134
I also encountered a corrupt LevelDB and it also appeared to be a suspend related issue. My guess - power management on modern Macs is buggy and is likely to cause the file system to lose its integrity in some way. The fact that Mac's do sometimes just die and refuse to unsuspend strongly suggests the presence of fatal errors in their implementation. Recent OS X versions are sloppy in other ways - when the laptop lid is opened and the unsuspend process begins, the first thing it does is display a screenshot of the password entry screen! Of course it's not actually usable for many seconds so any keypresses you make get thrown away. This kind of duplicitous nonsense is classic Apple. Now think - if your power management engineering team is the kind that'd make such a decision, do you trust them to get the details 100% right? I wouldn't.

All that said running nodes on systems that come and go all the time is hardly helping the network and most users will get tired of it sucking up battery and other resources. I can't see running full nodes on laptops being popular in the long run. So fixing this doesn't seem to be very important to me, certainly it shouldn't be seen as blocking pruning. If it's robust on Linux servers, that's the most important thing.

legendary
Activity: 1372
Merit: 1008
1davout
legendary
Activity: 1596
Merit: 1100
Can you please disclose what OS, OS version, and Bitcoin version you're running?

Discussed this with you a few weeks ago.
Happens to me every single time OSX ML fails to come back up from suspend.
Pretty much the only reason why I stopped running a node.

What is an OSX ML?

kjj
legendary
Activity: 1302
Merit: 1026
Contrary to what KJJ claims— it is actually not supposed to do this

Heh.  I never said that it was supposed to happen.  I was just pointing out the workaround used all around the world by people with unreliable power.

Maybe we should make a sticky at the top for bug reporting best practices.
legendary
Activity: 1372
Merit: 1008
1davout
Can you please disclose what OS, OS version, and Bitcoin version you're running?

Discussed this with you a few weeks ago.
Happens to me every single time OSX ML fails to come back up from suspend.
Pretty much the only reason why I stopped running a node.
legendary
Activity: 2053
Merit: 1356
aka tonikt
I once gave you a snapshot of a naturally corrupt testnet3 DB, that the official client wasn't able to continue with... and you didn't even bother do download it, did you?
And now you suddenly care... Smiley
legendary
Activity: 3472
Merit: 1722
Can you please disclose what OS, OS version, and Bitcoin version you're running?

This also happened to me once on Win7 64, but that was 2 years ago with 0.3-something.
staff
Activity: 4242
Merit: 8672
Can you please disclose what OS, OS version, and Bitcoin version you're running?

I've tried to reproduce unclean shutdown corruption and in hundreds of shutdowns in Linux been unable to do so.

Contrary to what KJJ claims— it is actually not supposed to do this, and at least on some systems it does not appear to (or at least does so with only negligible probability).  I suspect that leveldb has some bugs on some systems/enviroments which degrades its durability, but with basically nothing to go on its hard to determine why.

We absolutely _must_ get this fixed— or at least reduced to negligible probability for all users— before we can support pruning.
legendary
Activity: 2053
Merit: 1356
aka tonikt
just make a backup of your chain, once for awhile.
or download the chain from some torrents, if you hadn't.
otherwise it takes ages to recover from such a lose Smiley
newbie
Activity: 30
Merit: 0
Clever programming simply cannot prevent corruption during a power loss. 

Atomic file operations would ensure that the disk is always in a valid state.  However, it doesn't seem to be a high priority for OS designers.

It is solved. The answer is: ZFS
Here is a link: http://en.wikipedia.org/wiki/ZFS
Pages:
Jump to: