Recent downtime and data loss - page 3.

Rishblitz

full member

Activity: 308

Merit: 100

I'm nothing without GOD

That sucks but at least its working again.

jacktheking

legendary

Activity: 1484

Merit: 1001

Personal Text Space Not For Sale

Looks like in lost four posts. My signature campaign require me to have 165 posts. I remember I reached it. When the forums came back online, I have 161 posts.. anyway, good to hear the forums is back online.

Bizmark13

sr. member

Activity: 462

Merit: 250

The forum went down right after I hit the preview button and just as I realized that my post was missing a [/QUOTE] tag. For a second or two, I got BBcode confused with HTML and thought there was a possibility that I broke the forum.

Anyway, to those who aren't sure if they have posts that are deleted or not (particularly to those who post a lot and might not remember how many posts they made prior to the data loss), go through your browser history and you'll get an idea of which posts you need to re-post.

johnyj

legendary

Activity: 1988

Merit: 1012

Beyond Imagination

This totally defeated the purpose of running RAID 10, such high fail rate is already higher than conventional HDD, and I don't think RAID 0 is needed for SSD, they are already enough fast. Use two RAID 1 to backup each other is a better solution, but anyway this is very strange, RAID 1 should give enough warning before a total failure

egghead123

legendary

Activity: 1330

Merit: 1000

BADecker

legendary

Activity: 4046

Merit: 1389

Quote from: Buffer Overflow on January 23, 2015, 12:03:00 PM

Quote from: BADecker on January 23, 2015, 12:00:39 PM

Quote from: Buffer Overflow on January 23, 2015, 11:53:57 AM

Quote from: BADecker on January 23, 2015, 11:27:36 AM

Quote from: deepceleron on January 23, 2015, 09:39:30 AM

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

If something was mission critical, it wouldn't be running on a home computer in the first place.

The whole internet is important to me. My home computer is important to me. Without my home computer, I wouldn't be able to access the Internet as easily.

I use my home computer for other things. I empathize with people in their love/hate relationships with their own computers. Yet mine is mission critical to me.

Backup software. Plenty about.

Thank you. Wink

Buffer Overflow

legendary

Activity: 1652

Merit: 1016

Quote from: BADecker on January 23, 2015, 12:00:39 PM

Quote from: Buffer Overflow on January 23, 2015, 11:53:57 AM

Quote from: BADecker on January 23, 2015, 11:27:36 AM

Quote from: deepceleron on January 23, 2015, 09:39:30 AM

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

If something was mission critical, it wouldn't be running on a home computer in the first place.

The whole internet is important to me. My home computer is important to me. Without my home computer, I wouldn't be able to access the Internet as easily.

I use my home computer for other things. I empathize with people in their love/hate relationships with their own computers. Yet mine is mission critical to me.

Backup software. Plenty about.

BADecker

legendary

Activity: 4046

Merit: 1389

Quote from: Buffer Overflow on January 23, 2015, 11:53:57 AM

Quote from: BADecker on January 23, 2015, 11:27:36 AM

Quote from: deepceleron on January 23, 2015, 09:39:30 AM

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

If something was mission critical, it wouldn't be running on a home computer in the first place.

The whole internet is important to me. My home computer is important to me. Without my home computer, I wouldn't be able to access the Internet as easily.

I use my home computer for other things. I empathize with people in their love/hate relationships with their own computers. Yet mine is mission critical to me.

Buffer Overflow

legendary

Activity: 1652

Merit: 1016

Quote from: BADecker on January 23, 2015, 11:27:36 AM

Quote from: deepceleron on January 23, 2015, 09:39:30 AM

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

If something was mission critical, it wouldn't be running on a home computer in the first place.
But since you asked, a couple of disks, software raid, Linux. and a good backup schedule. Job done.

fairglu

legendary

Activity: 1100

Merit: 1032

Quote from: deepceleron on January 23, 2015, 09:39:30 AM

Backups are now just for when the whole place burns to the ground.

Do not underestimate, the Plain Old Bugs (tm), Plain Old Human Errors (tm) and the Drunk or Drugged SysAdmin (r)

Much more common than destruction by fire Tongue

BADecker

legendary

Activity: 4046

Merit: 1389

Quote from: deepceleron on January 23, 2015, 09:39:30 AM

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

kolloh

legendary

Activity: 1736

Merit: 1023

Thanks for posting the detailed information about what happened. From a technical perspective, it is interesting reading about how things are configured.

deepceleron, I'm actually planning to use a RAIDZ2 pool for a database server that I'm working on as it looks like a very nice solution. RAIDZ2 only allows two disks to fail but I am liking what I read about ZFS thus far. The LZ4 compression also looks to be pretty handy.

deepceleron

legendary

Activity: 1512

Merit: 1036

Quote from: stellar69 on January 23, 2015, 07:37:28 AM

What happens if you keep the backup on a ssd too and that ssd also gets corrupted?
Will this lead to the entire forum getting wiped out?

What happens if I write my password down on two pieces of paper and then set both papers on fire, will I lose my password? Please don't ask such silly questions in a Theymos thread for the sake of spamming your signature.

Quote from: theymos on January 23, 2015, 01:03:44 AM

The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.

My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)

There is an interesting aspect of RAID arrays when using SSDs in a mirrored configuration that can only tolerate the failure of one drive - you are writing identical data to identical drives, and should expect them to fail in identical ways, killing two drives at once.

Physical hard drives are more subject to mechanical tolerances that vary between samples. The disk coatings are not deposited equally, the bearings are not made molecularly identical, the windings in the heads and the motors aren't perfect matches. We would expect given a high intensity and identical load to two drives that it would be virtually impossible for them to suffer failure at the same time.

SSDs are different. Some drives have firmware that specifically bricks the drive or turns it into a read-only drive after a certain number of writes. While the actual memory cells may fail differently between the drives, they wear at a predictable statistical rate and there is a reserve of usually 2-5% of drive space of extra sectors that will finally be exhausted at nearly identical times given identical write patterns.

SSDs push the limit of what can be stored on silicon, so they have many layers of error correction to go along with the wear leveling. It is quite hard to get bad data back out of a well-designed Tier 1 drive.

However, failure of SSDs can be random and capricious, especially the OCZ/Patriot/Crucial bottom-tier drives that just shit themselves for no reason. As it is unlikely the forum completely used up the drive wear life on it's drives, it is more likely random crap-drive failure or RAID controller failure.

For reference, look at the tests here: http://techreport.com/review/27436/the-ssd-endurance-experiment-two-freaking-petabytes - they designed a test specifically to kill SSDs with writes and left it running 24/7 for over a year on six drives. You can see that the SMART wear indicators on most drives indicate the amount of drive life left, some lock up after the reserve is used, and some keep going into write error territory. Only one drive unexpectedly died. SMART analysis of forum's drives will likely indicate the status of the reallocated sector count and wear leveling count to see if failure was unexpected.

Outside of the SSDs themselves failing, most hardware RAID controllers are pretty dumb (and if you are spending less than ~$400 you are not even getting hardware RAID). They just write the same data to two drives. There is no error correcting or checksumming, so they are useless to fix corruption, and actually are less tolerant than just a single drive would be. If you want to see scary, look up "RAID 5 write hole", basically there is no way for these RAIDs to tolerate power loss, making on-controller battery backup super important.

Also, consumer hardware just plan HAS errors, and hardware RAID is not written to deal with them: http://www.zdnet.com/article/has-raid5-stopped-working/

Also majorly important is that the system must be running ECC RAM, and the RAID controller must also have ECC RAM if it has cache slots.

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

alani123

legendary

Activity: 2422

Merit: 1451

Leading Crypto Sports Betting & Casino Platform

I can't really see this being a conspiracy tbh. Unless damning evidence is presented I'll keep believing that is was some legit downtime due to the issues theymos talked about. The fact that this ha never happened to the forum makes me trust him a bit more about this.

matt4054

legendary

Activity: 1946

Merit: 1035

Thanks for bringing the forum back online and posting all the gory details about it

dogie

legendary

Activity: 1666

Merit: 1185

dogiecoin.com

Quote from: stellar69 on January 23, 2015, 07:37:28 AM

What happens if you keep the backup on a ssd too and that ssd also gets corrupted?
Will this lead to the entire forum getting wiped out?

Backup drives have no need to be quick so they're probably high capacity non SSD drives in a raid 1. And because they're much larger capacity than the active SSD raid array, they can contain multiple, historical backups of the same databases. At least, that's how I'd do it on a budget. Probably some periodical backups moved elsewhere as well.

stellar69

full member

Activity: 196

Merit: 100

What happens if you keep the backup on a ssd too and that ssd also gets corrupted?
Will this lead to the entire forum getting wiped out?

dogie

legendary

Activity: 1666

Merit: 1185

dogiecoin.com

Quote from: CanaryInTheMine on January 23, 2015, 01:42:57 AM

Some ssd drive health monitoring might help... Ssd drives deteriorate over time...
Glad you got it restored!

Depends on the controller and the age of the firmware whether you can see true SMART stats from the drives or even they're masked. My SSD's raid array (gen 1 X25-Ms) gives me goop as SMART readings which I know are false.

bornil267645

sr. member

Activity: 406

Merit: 250

AltoCenter.com

I was on the road when this happened. As I came to see my post. I saw 9 of my posts vanished. But then I saw this post. Glad the the site is back online.

Yuki1988

hero member

Activity: 614

Merit: 500

Quote from: dsyahputera on January 23, 2015, 06:22:28 AM

Is the search feature is one of the problems? It disabled now Undecided

You need to read theymos' post more carefully.

Quote from: theymos on January 23, 2015, 01:03:44 AM

Search is temporarily disabled because I need to regenerate the search index before it will be usable again.

Topic: Recent downtime and data loss - page 3. (Read 6112 times)