Recent downtime and data loss - page 5.

IamCANADIAN013

hero member

Activity: 714

Merit: 503

Thank you for your hard work getting the forum back up theymos, much appreciated!

Gotta admit, I was starting to worry with it being down for so long.

Deadstock

full member

Activity: 156

Merit: 100

I was so bored with BCT down all day at work Grin

Cyrus

administrator

Activity: 4004

Merit: 3219

Quote from: smoothie on January 23, 2015, 02:15:07 AM

Was this the longest down time in the past few years?

This was the longest I've experienced: https://bitcointalksearch.org/topic/about-the-recent-attack-306936

PS: It's good to be back!

smoothie

legendary

Activity: 2492

Merit: 1491

LEALANA Bitcoin Grim Reaper

Was this the longest down time in the past few years?

I can't remember a longer one being an avid poster etc...

3btc

full member

Activity: 224

Merit: 101

So awesome that bitcointalk is back!

*yay*

I hope you don't have a too big sleep deficiency now Wink

Wendigo

legendary

Activity: 2604

Merit: 1036

Glad to see the forum is back up and running after that downtime.

redsn0w

legendary

Activity: 1778

Merit: 1043

#Free market

Thanks theymos for the information and good luck.

CanaryInTheMine

donator

Activity: 2352

Merit: 1060

between a rock and a block!

Some ssd drive health monitoring might help... Ssd drives deteriorate over time...
Glad you got it restored!

Quickseller

copper member

Activity: 2996

Merit: 2374

On reddit there was a discussion as to why we are not using something like amazon AWS for hosing.

Is this because we get free internet from PIA, or are there other drawbacks to using AWS verses our current setup?

Welsh

staff

Activity: 3332

Merit: 4117

It's good to be back nonetheless, keep us updated with the investigation. Minimal damage was done, and it was back up in a pretty speedy fashion (considering the nature of the downtime), well done to you and the team.

theymos

administrator

Activity: 5222

Merit: 13032

Quote from: grendel25 on January 23, 2015, 01:12:31 AM

Good job getting it back up. I'm used to a config where you just hot swap drives that are about to fail. Planning any changes after this experience?

That's what I expected to happen, but the RAID controller didn't notice that anything was wrong. I still need to figure out why.

grendel25

legendary

Activity: 2296

Merit: 1031

Good job getting it back up. I'm used to a config where you just hot swap drives that are about to fail. Planning any changes after this experience?

theymos

administrator

Activity: 5222

Merit: 13032

Due to a failure of the RAID array that bitcointalk.org was running on, the database became corrupted. It was necessary to move the OS and restore the database using a daily backup. About 8 hours of data was lost (anything after about Jan 21 21:44 UTC).

I've been busy getting the forum back online, so I haven't investigated this much yet, but it may be possible for me to manually restore lost posts/PMs by searching the recovered database files for keywords that you remember. If you lost a post or PM that you feel is absolutely irreplacable, PM me and I'll see if I can recover it.

Everyone's drafts were also all lost. These are not backed up because they're automatically deleted after 14 days anyway.

Search is temporarily disabled because I need to regenerate the search index before it will be usable again.

There will some periodic downtime over the next few days (a few hours in total) as we get everything reconfigured/settled. A few things might be broken. Tell me if you see any bugs.

If you paid to remove a proxyban and this was not recognized due to the downtime, email the txid to the pbbugs email address and I'll whitelist you right away.

This week's ad stats were lost. The current ads will be up for an extra long time to make up for the downtime and the lost stats.

Sorry for the inconvenience!

Technical details:

The bitcointalk.org and bitcoin.it databases were stored on a RAID 1+0 array: two RAID 1 arrays of 2 SSDs each, joined via RAID 0 (so 4 SSDs total, all the same model). We noticed yesterday that there were some minor file system errors on the bitcoin.it VM, but we took it for a fluke because there were no ongoing problems and the RAID controller reported no disk issues. A few hours later, the bitcointalk.org file system also started experiencing errors. When this was noticed, the bitcointalk.org database files were immediately moved elsewhere, but the RAID array deteriorated rapidly, and most of the database files ended up being too badly corrupted to be used. So a separate OS was set up on a different RAID array, and the database was restored using a daily backup.

My guess is that both of the SSDs in one of the RAID-1 sub-arrays started running out of spare sectors at around the same time. bitcoin.it runs on the same array, and it's been running low on memory for a few weeks, so its use of swap may have been what accelerated the deterioration of these SSDs. The RAID controller still reports no issues with the disks, but I don't see what else could cause this to happen to two distinct VMs. I guess the RAID controller doesn't know how to get the SMART data from these drives. (The drives are fairly old SSDs, so maybe they don't even support SMART.)

I plan on doing more investigation later to make sure that this doesn't happen again. I will probably also set up MySQL replication (or something) to prevent so much data loss in case something similar does happen again.

On the bright side, the backup worked fairly smoothly. This is the first time I've had to use one of the daily backups for real restoration.

Topic: Recent downtime and data loss - page 5. (Read 6112 times)