Pages:
Author

Topic: Recent downtime and data loss - page 5. (Read 6101 times)

hero member
Activity: 714
Merit: 503
January 23, 2015, 02:29:56 AM
#13
Thank you for your hard work getting the forum back up theymos, much appreciated!

Gotta admit, I was starting to worry with it being down for so long.
full member
Activity: 156
Merit: 100
January 23, 2015, 02:28:22 AM
#12
I was so bored with BCT down all day at work  Grin
administrator
Activity: 3962
Merit: 3184
January 23, 2015, 02:26:32 AM
#11
Was this the longest down time in the past few years?

This was the longest I've experienced: https://bitcointalksearch.org/topic/about-the-recent-attack-306936

PS: It's good to be back!
legendary
Activity: 2492
Merit: 1491
LEALANA Bitcoin Grim Reaper
January 23, 2015, 02:15:07 AM
#10
Was this the longest down time in the past few years?

I can't remember a longer one being an avid poster etc...
full member
Activity: 224
Merit: 101
January 23, 2015, 02:07:18 AM
#9
So awesome that bitcointalk is back!  Smiley *yay*

I hope you don't have a too big sleep deficiency now  Wink
legendary
Activity: 2604
Merit: 1036
January 23, 2015, 02:06:08 AM
#8
Glad to see the forum is back up and running after that downtime.
legendary
Activity: 1778
Merit: 1043
#Free market
January 23, 2015, 02:04:26 AM
#7
Thanks theymos for the information and good luck.
donator
Activity: 2352
Merit: 1060
between a rock and a block!
January 23, 2015, 01:42:57 AM
#6
Some ssd drive health monitoring might help... Ssd drives deteriorate over time...
Glad you got it restored!
copper member
Activity: 2996
Merit: 2374
January 23, 2015, 01:42:44 AM
#5
On reddit there was a discussion as to why we are not using something like amazon AWS for hosing.

Is this because we get free internet from PIA, or are there other drawbacks to using AWS verses our current setup?
staff
Activity: 3304
Merit: 4115
January 23, 2015, 01:22:20 AM
#4
It's good to be back nonetheless, keep us updated with the investigation. Minimal damage was done, and it was back up in a pretty speedy fashion (considering the nature of the downtime), well done to you and the team.
administrator
Activity: 5222
Merit: 13032
January 23, 2015, 01:21:27 AM
#3
Good job getting it back up.  I'm used to a config where you just hot swap drives that are about to fail.  Planning any changes after this experience?

That's what I expected to happen, but the RAID controller didn't notice that anything was wrong. I still need to figure out why.
legendary
Activity: 2296
Merit: 1031
January 23, 2015, 01:12:31 AM
#2
Good job getting it back up.  I'm used to a config where you just hot swap drives that are about to fail.  Planning any changes after this experience?
administrator
Activity: 5222
Merit: 13032
January 23, 2015, 01:03:44 AM
#1
Due to a failure of the RAID array that bitcointalk.org was running on, the database became corrupted. It was necessary to move the OS and restore the database using a daily backup. About 8 hours of data was lost (anything after about Jan 21 21:44 UTC).

I've been busy getting the forum back online, so I haven't investigated this much yet, but it may be possible for me to manually restore lost posts/PMs by searching the recovered database files for keywords that you remember. If you lost a post or PM that you feel is absolutely irreplacable, PM me and I'll see if I can recover it.

Everyone's drafts were also all lost. These are not backed up because they're automatically deleted after 14 days anyway.

Search is temporarily disabled because I need to regenerate the search index before it will be usable again.

There will some periodic downtime over the next few days (a few hours in total) as we get everything reconfigured/settled. A few things might be broken. Tell me if you see any bugs.

If you paid to remove a proxyban and this was not recognized due to the downtime, email the txid to the pbbugs email address and I'll whitelist you right away.

This week's ad stats were lost. The current ads will be up for an extra long time to make up for the downtime and the lost stats.

Sorry for the inconvenience!

Technical details:

The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.

My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)

I plan on doing more investigation later to make sure that this doesn't happen again. I will probably also set up MySQL replication (or something) to prevent so much data loss in case something similar does happen again.

On the bright side, the backup worked fairly smoothly. This is the first time I've had to use one of the daily backups for real restoration.
Pages:
Jump to: