Pages:
Author

Topic: Recent downtime and data loss - page 3. (Read 6083 times)

full member
Activity: 308
Merit: 100
I'm nothing without GOD
January 23, 2015, 09:41:32 PM
#53
That sucks but at least its working again.
legendary
Activity: 1484
Merit: 1001
Personal Text Space Not For Sale
January 23, 2015, 08:31:52 PM
#52
Looks like in lost four posts. My signature campaign require me to have 165 posts. I remember I reached it. When the forums came back online, I have 161 posts.. anyway, good to hear the forums is back online.
sr. member
Activity: 462
Merit: 250
January 23, 2015, 08:17:15 PM
#51
The forum went down right after I hit the preview button and just as I realized that my post was missing a [/QUOTE] tag. For a second or two, I got BBcode confused with HTML and thought there was a possibility that I broke the forum.

Anyway, to those who aren't sure if they have posts that are deleted or not (particularly to those who post a lot and might not remember how many posts they made prior to the data loss), go through your browser history and you'll get an idea of which posts you need to re-post.
legendary
Activity: 1988
Merit: 1012
Beyond Imagination
January 23, 2015, 04:38:40 PM
#50
This totally defeated the purpose of running RAID 10, such high fail rate is already higher than conventional HDD, and I don't think RAID 0 is needed for SSD, they are already enough fast. Use two RAID 1 to backup each other is a better solution, but anyway this is very strange, RAID 1 should give enough warning before a total failure
legendary
Activity: 1330
Merit: 1000
January 23, 2015, 01:10:19 PM
#49
legendary
Activity: 3906
Merit: 1373
January 23, 2015, 12:11:37 PM
#48

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

Smiley

If something was mission critical, it wouldn't be running on a home computer in the first place.

The whole internet is important to me. My home computer is important to me. Without my home computer, I wouldn't be able to access the Internet as easily.

I use my home computer for other things. I empathize with people in their love/hate relationships with their own computers. Yet mine is mission critical to me.

Smiley

Backup software. Plenty about.

Thank you.    Wink
legendary
Activity: 1652
Merit: 1016
January 23, 2015, 12:03:00 PM
#47

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

Smiley

If something was mission critical, it wouldn't be running on a home computer in the first place.

The whole internet is important to me. My home computer is important to me. Without my home computer, I wouldn't be able to access the Internet as easily.

I use my home computer for other things. I empathize with people in their love/hate relationships with their own computers. Yet mine is mission critical to me.

Smiley

Backup software. Plenty about.
legendary
Activity: 3906
Merit: 1373
January 23, 2015, 12:00:39 PM
#46

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

Smiley

If something was mission critical, it wouldn't be running on a home computer in the first place.

The whole internet is important to me. My home computer is important to me. Without my home computer, I wouldn't be able to access the Internet as easily.

I use my home computer for other things. I empathize with people in their love/hate relationships with their own computers. Yet mine is mission critical to me.

Smiley
legendary
Activity: 1652
Merit: 1016
January 23, 2015, 11:53:57 AM
#45

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

Smiley

If something was mission critical, it wouldn't be running on a home computer in the first place.
But since you asked, a couple of disks, software raid, Linux. and a good backup schedule. Job done.
legendary
Activity: 1100
Merit: 1032
January 23, 2015, 11:43:49 AM
#44
Backups are now just for when the whole place burns to the ground.

Do not underestimate, the Plain Old Bugs (tm), Plain Old Human Errors (tm) and the Drunk or Drugged SysAdmin (r)

Much more common than destruction by fire Tongue
legendary
Activity: 3906
Merit: 1373
January 23, 2015, 11:27:36 AM
#43

...

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.

With an operation like BitcoinTalk, or any other well-funded operation, this might be doable. But can the little guy (gal) afford this for their home computer? What exists for the little people?

Smiley
legendary
Activity: 1736
Merit: 1023
January 23, 2015, 10:01:05 AM
#42
Thanks for posting the detailed information about what happened. From a technical perspective, it is interesting reading about how things are configured.

deepceleron, I'm actually planning to use a RAIDZ2 pool for a database server that I'm working on as it looks like a very nice solution. RAIDZ2 only allows two disks to fail but I am liking what I read about ZFS thus far. The LZ4 compression also looks to be pretty handy.
legendary
Activity: 1512
Merit: 1036
January 23, 2015, 09:39:30 AM
#41
What happens if you keep the backup on a ssd too and that ssd also gets corrupted?
Will this lead to the entire forum getting wiped out?

What happens if I write my password down on two pieces of paper and then set both papers on fire, will I lose my password? Please don't ask such silly questions in a Theymos thread for the sake of spamming your signature.

The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.

My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)

There is an interesting aspect of RAID arrays when using SSDs in a mirrored configuration that can only tolerate the failure of one drive - you are writing identical data to identical drives, and should expect them to fail in identical ways, killing two drives at once.

Physical hard drives are more subject to mechanical tolerances that vary between samples. The disk coatings are not deposited equally, the bearings are not made molecularly identical, the windings in the heads and the motors aren't perfect matches. We would expect given a high intensity and identical load to two drives that it would be virtually impossible for them to suffer failure at the same time.

SSDs are different. Some drives have firmware that specifically bricks the drive or turns it into a read-only drive after a certain number of writes. While the actual memory cells may fail differently between the drives, they wear at a predictable statistical rate and there is a reserve of usually 2-5% of drive space of extra sectors that will finally be exhausted at nearly identical times given identical write patterns.

SSDs push the limit of what can be stored on silicon, so they have many layers of error correction to go along with the wear leveling. It is quite hard to get bad data back out of a well-designed Tier 1 drive.

However, failure of SSDs can be random and capricious, especially the OCZ/Patriot/Crucial bottom-tier drives that just shit themselves for no reason. As it is unlikely the forum completely used up the drive wear life on it's drives, it is more likely random crap-drive failure or RAID controller failure.

For reference, look at the tests here: http://techreport.com/review/27436/the-ssd-endurance-experiment-two-freaking-petabytes - they designed a test specifically to kill SSDs with writes and left it running 24/7 for over a year on six drives. You can see that the SMART wear indicators on most drives indicate the amount of drive life left, some lock up after the reserve is used, and some keep going into write error territory. Only one drive unexpectedly died. SMART analysis of forum's drives will likely indicate the status of the reallocated sector count and wear leveling count to see if failure was unexpected.

Outside of the SSDs themselves failing, most hardware RAID controllers are pretty dumb (and if you are spending less than ~$400 you are not even getting hardware RAID). They just write the same data to two drives. There is no error correcting or checksumming, so they are useless to fix corruption, and actually are less tolerant than just a single drive would be. If you want to see scary, look up "RAID 5 write hole", basically there is no way for these RAIDs to tolerate power loss, making on-controller battery backup super important.

Also, consumer hardware just plan HAS errors, and hardware RAID is not written to deal with them: http://www.zdnet.com/article/has-raid5-stopped-working/

Also majorly important is that the system must be running ECC RAM, and the RAID controller must also have ECC RAM if it has cache slots.

I am getting much more behind ZFS RAID. It is very impressive. It is software RAID run by the OS. This is no longer a bad thing. The CPU and OS has hundreds of times more processing and RAM than RAID cards. There is a journal written that can be replayed upon power loss and commits are made to the disk in a way that data will never be corrupted. Everything on the disks is self-healing with error-correction checksummed. The OS can talk directly to the drives and their SMART status to understand drive state. You can run RAID on standard SATA controllers, and when your motherboard burns up, mount the drives in any other motherboard and on any other controller. You don't need to have hot-spares and lengthy rebuilds, you can have RAIDZ3 - three extra drives of parity - so it would take three drive failures to take out your disk array.

Then mirror the whole machine automatically with High-Availibility storage (HAST) and CARP in BSD. Backups are now just for when the whole place burns to the ground.
legendary
Activity: 2422
Merit: 1451
Leading Crypto Sports Betting & Casino Platform
January 23, 2015, 09:20:06 AM
#40
I can't really see this being a conspiracy tbh. Unless damning evidence is presented I'll keep believing that is was some legit downtime due to the issues theymos talked about. The fact that this ha never happened to the forum makes me trust him a bit more about this.
legendary
Activity: 1946
Merit: 1035
January 23, 2015, 09:17:01 AM
#39
Thanks for bringing the forum back online and posting all the gory details about it Smiley
legendary
Activity: 1666
Merit: 1185
dogiecoin.com
January 23, 2015, 08:54:07 AM
#38
What happens if you keep the backup on a ssd too and that ssd also gets corrupted?
Will this lead to the entire forum getting wiped out?

Backup drives have no need to be quick so they're probably high capacity non SSD drives in a raid 1. And because they're much larger capacity than the active SSD raid array, they can contain multiple, historical backups of the same databases. At least, that's how I'd do it on a budget. Probably some periodical backups moved elsewhere as well.
full member
Activity: 196
Merit: 100
January 23, 2015, 07:37:28 AM
#37
What happens if you keep the backup on a ssd too and that ssd also gets corrupted?
Will this lead to the entire forum getting wiped out?
legendary
Activity: 1666
Merit: 1185
dogiecoin.com
January 23, 2015, 07:09:15 AM
#36
Some ssd drive health monitoring might help... Ssd drives deteriorate over time...
Glad you got it restored!

Depends on the controller and the age of the firmware whether you can see true SMART stats from the drives or even they're masked. My SSD's raid array (gen 1 X25-Ms) gives me goop as SMART readings which I know are false.
sr. member
Activity: 406
Merit: 250
AltoCenter.com
January 23, 2015, 07:00:39 AM
#35
I was on the road when this happened. As I came to see my post. I saw 9 of my posts vanished. But then I saw this post. Glad the the site is back online.
hero member
Activity: 614
Merit: 500
January 23, 2015, 06:58:57 AM
#34
Is the search feature is one of the problems? It disabled now Undecided

You need to read theymos' post more carefully. Smiley
Search is temporarily disabled because I need to regenerate the search index before it will be usable again.
Pages:
Jump to: