Blockchain corruption during power loss?

Mike Hearn

legendary

Activity: 1526

Merit: 1134

Yeah, it does do at least fdatasync in some cases. OK, actually, I don't know the exact requirements LevelDB makes of the OS, but I really doubt it's anything un-POSIXish.

jgarzik

legendary

Activity: 1596

Merit: 1100

Quote from: Mike Hearn on July 08, 2013, 10:09:42 AM

LevelDB is designed to survive certain kinds of file system corruption by ensuring that all commits are atomic .... on the assumption that file system renames and write() calls are atomic. If that underlying OS assumption is violated by bugs in the OS or hardware, LevelDB can corrupt itself as would any other database.

Write calls have never been atomic in any Unix-ish OS... They may be reordered by the OS between fsync/fdatasync calls, and may be reordered again at the hardware (disk) level, unless the OS sends a hardware flush command (FLUSH CACHE / SYNCHRONIZE CACHE).

Mike Hearn

legendary

Activity: 1526

Merit: 1134

The issue is not suspend/resume when it works, which LevelDB can survive just fine, it's when the OS or hardware itself screws up the resume process.

ACPI suspend/resume is unbelievably complicated, at least the way it used to work required the BIOS to provide an actual program written in a special kind of assembly language run by an ACPI interpreter as part of bringing the system down/up. If anything goes wrong in that process, all bets are off - pretty much anything could happen to the data on disk.

LevelDB is designed to survive certain kinds of file system corruption by ensuring that all commits are atomic .... on the assumption that file system renames and write() calls are atomic. If that underlying OS assumption is violated by bugs in the OS or hardware, LevelDB can corrupt itself as would any other database.

jgarzik

legendary

Activity: 1596

Merit: 1100

Quote from: Mike Hearn on July 08, 2013, 07:14:11 AM

But the simpler fix is to just not run Bitcoin-Qt on Macs that might sleep a lot.

If leveldb cannot handle suspend/resume with full data integrity, then we may need to revisit it.

Mike Hearn

legendary

Activity: 1526

Merit: 1134

You shouldn't have to download, reindexing is enough. Whether it's possible to handle more gracefully really depends on the exact details of what goes wrong.

What we could potentially do is make rolling backups of verified-consistent databases, and then just roll back to the last database then replay from that point onwards. So it'd reduce the amount of reindexing time.

But the simpler fix is to just not run Bitcoin-Qt on Macs that might sleep a lot.

davout

legendary

Activity: 1372

Merit: 1008

1davout

Quote from: Mike Hearn on July 08, 2013, 06:42:15 AM

Yeah so the Mac Mini went into a sleep state just like a laptop did. I think the Mini vs laptop distinction is a distraction here, the root cause is failed unsuspends which leads me to wonder if it's even solvable by LevelDB or us. If the machine just fails to wake up then it's obviously hosed internally in some bad way.

Well, the root cause is obviously the Mac, right.
The bug on bitcoind's side though, IMO, is that it forces you to go through hours and hours of reindexing redownloading when this kind of stuff happens instead of handling it more gracefully.

Mike Hearn

legendary

Activity: 1526

Merit: 1134

Yeah so the Mac Mini went into a sleep state just like a laptop did. I think the Mini vs laptop distinction is a distraction here, the root cause is failed unsuspends which leads me to wonder if it's even solvable by LevelDB or us. If the machine just fails to wake up then it's obviously hosed internally in some bad way.

davout

legendary

Activity: 1372

Merit: 1008

1davout

Quote from: Mike Hearn on July 08, 2013, 06:27:03 AM

Is the Mac Mini actually entering a sleep state of some kind? You said it happens when the machine comes back from suspend. Now your machine is just "idling". So which is it? If the computer is just running normally then that'd imply spontaneous random destruction of the db, which I've not seen myself.

The steps that I used to reproduce it before giving up on bitcoind on the Mac mini :

- Have a good evening coding around, listening to some nice music and stuff,
- go to bed
- come back to computer with cup of coffee
- fail to bring computer back up from idle/sleep/suspend or whatever a mac does when you leave it alone for a while
- hard reboot it
- rage at bitcoind casually telling me that the blockchain is corrupted and that i need to reindex

Conclusion : in the presence of coffee, leveldb spontaneously self-destructs.

Mike Hearn

legendary

Activity: 1526

Merit: 1134

Well, your last paragraph explains why I'm not myself working on pruning right now Wink

(also sipa said he'd do it).

Is the Mac Mini actually entering a sleep state of some kind? You said it happens when the machine comes back from suspend. Now your machine is just "idling". So which is it? If the computer is just running normally then that'd imply spontaneous random destruction of the db, which I've not seen myself.

davout

legendary

Activity: 1372

Merit: 1008

1davout

Quote from: Mike Hearn on July 08, 2013, 05:32:57 AM

I also encountered a corrupt LevelDB and it also appeared to be a suspend related issue. My guess - power management on modern Macs is buggy and is likely to cause the file system to lose its integrity in some way. The fact that Mac's do sometimes just die and refuse to unsuspend strongly suggests the presence of fatal errors in their implementation. Recent OS X versions are sloppy in other ways - when the laptop lid is opened and the unsuspend process begins, the first thing it does is display a screenshot of the password entry screen! Of course it's not actually usable for many seconds so any keypresses you make get thrown away. This kind of duplicitous nonsense is classic Apple. Now think - if your power management engineering team is the kind that'd make such a decision, do you trust them to get the details 100% right? I wouldn't.

All that said running nodes on systems that come and go all the time is hardly helping the network and most users will get tired of it sucking up battery and other resources. I can't see running full nodes on laptops being popular in the long run. So fixing this doesn't seem to be very important to me, certainly it shouldn't be seen as blocking pruning. If it's robust on Linux servers, that's the most important thing.

It's not a laptop it's a Mac Mini that simply goes idle from time to time, for example at night when unused.
And bitcoind should probably be a little more resilient than "oh really, you let your computer go idle, let's just re-download the whole chain".

If you think bitcoind should only be resilient on Debian stable in a well-connected datacenter you're going to keep seeing the general decline in nodes that is being experienced.

If you reason this way why would you want to implement pruning at all? After all if bitcoind runs fine on a server with an i7, 1To disk, 32 Go RAM and a 1Gbps connection that's the most important thing right ?

Mike Hearn

legendary

Activity: 1526

Merit: 1134

I also encountered a corrupt LevelDB and it also appeared to be a suspend related issue. My guess - power management on modern Macs is buggy and is likely to cause the file system to lose its integrity in some way. The fact that Mac's do sometimes just die and refuse to unsuspend strongly suggests the presence of fatal errors in their implementation. Recent OS X versions are sloppy in other ways - when the laptop lid is opened and the unsuspend process begins, the first thing it does is display a screenshot of the password entry screen! Of course it's not actually usable for many seconds so any keypresses you make get thrown away. This kind of duplicitous nonsense is classic Apple. Now think - if your power management engineering team is the kind that'd make such a decision, do you trust them to get the details 100% right? I wouldn't.

All that said running nodes on systems that come and go all the time is hardly helping the network and most users will get tired of it sucking up battery and other resources. I can't see running full nodes on laptops being popular in the long run. So fixing this doesn't seem to be very important to me, certainly it shouldn't be seen as blocking pruning. If it's robust on Linux servers, that's the most important thing.

davout

legendary

Activity: 1372

Merit: 1008

1davout

Quote from: jgarzik on July 08, 2013, 12:57:48 AM

What is an OSX ML?

This.

jgarzik

legendary

Activity: 1596

Merit: 1100

Quote from: davout on July 07, 2013, 05:24:31 PM

Quote from: gmaxwell on July 07, 2013, 11:13:32 AM

Can you please disclose what OS, OS version, and Bitcoin version you're running?

Discussed this with you a few weeks ago.
Happens to me every single time OSX ML fails to come back up from suspend.
Pretty much the only reason why I stopped running a node.

What is an OSX ML?

kjj

legendary

Activity: 1302

Merit: 1026

Quote from: gmaxwell on July 07, 2013, 11:13:32 AM

Contrary to what KJJ claims— it is actually not supposed to do this

Heh. I never said that it was supposed to happen. I was just pointing out the workaround used all around the world by people with unreliable power.

Maybe we should make a sticky at the top for bug reporting best practices.

davout

legendary

Activity: 1372

Merit: 1008

1davout

Quote from: gmaxwell on July 07, 2013, 11:13:32 AM

Can you please disclose what OS, OS version, and Bitcoin version you're running?

Discussed this with you a few weeks ago.
Happens to me every single time OSX ML fails to come back up from suspend.
Pretty much the only reason why I stopped running a node.

piotr_n

legendary

Activity: 2058

Merit: 1416

aka tonikt

I once gave you a snapshot of a naturally corrupt testnet3 DB, that the official client wasn't able to continue with... and you didn't even bother do download it, did you?
And now you suddenly care...

malevolent

legendary

Activity: 3472

Merit: 1727

Quote from: gmaxwell on July 07, 2013, 11:13:32 AM

Can you please disclose what OS, OS version, and Bitcoin version you're running?

This also happened to me once on Win7 64, but that was 2 years ago with 0.3-something.

gmaxwell

staff

Activity: 4326

Merit: 8951

Can you please disclose what OS, OS version, and Bitcoin version you're running?

I've tried to reproduce unclean shutdown corruption and in hundreds of shutdowns in Linux been unable to do so.

Contrary to what KJJ claims— it is actually not supposed to do this, and at least on some systems it does not appear to (or at least does so with only negligible probability). I suspect that leveldb has some bugs on some systems/enviroments which degrades its durability, but with basically nothing to go on its hard to determine why.

We absolutely _must_ get this fixed— or at least reduced to negligible probability for all users— before we can support pruning.

piotr_n

legendary

Activity: 2058

Merit: 1416

aka tonikt

just make a backup of your chain, once for awhile.
or download the chain from some torrents, if you hadn't.
otherwise it takes ages to recover from such a lose

Mysticsam_3579

newbie

Activity: 30

Merit: 0

Quote from: TierNolan on July 03, 2013, 06:49:42 AM

Quote from: kjj on July 03, 2013, 05:57:23 AM

Clever programming simply cannot prevent corruption during a power loss.

Atomic file operations would ensure that the disk is always in a valid state. However, it doesn't seem to be a high priority for OS designers.

It is solved. The answer is: ZFS
Here is a link: http://en.wikipedia.org/wiki/ZFS

Topic: Blockchain corruption during power loss? (Read 2407 times)