Hello all,
A post on a different subject this time: We promised you a post-mortem on the outage last week. So here it is!
Last week there was a DDoS attack that lasted for more than a day (jan-17 18:00 - jan-18 23:00 UTC). This post is to explain what happened, the way we handled it, and what the future looks like.
Without any prior experience in handling this kind of attack, we were essentially powerless to do anything about it at first. That did not stop us from trying, but the pool only resurfaced on the internet when the attack eventually ceased.
Now that we've got some experience and have done our research we can say we know a bit more about such attacks. That knowledge is not entirely comforting. As it turns out, there's things that can be done, but at a certain point an attack will just knock out your service completely.
There have been a number of insightful comments, especially from Brian and Eleuthria, that helped us and our users in understanding the situation and being realistic about possible solutions while we were busy trying to solve it. Thanks for chiming in everyone!
Not even deepbit will stop a DDoS and they are a much bigger pool with a ton more money pouring in (%3 fee, 3500 ghps, ≈$15,000 per month). How about BTC Guild or Slush? They choke within minutes of an attack and will stay down at the discretion of the attacker. It's not realistic to demand expensive protection from pool operators for these attacks. BTC guild was literally blackmailed to keep a botnet on their server. When eleuthria finally banned him the pool was taken down within hours and didn't come back for days.
No pool can offer DDoS protection, not even Deepbit. The best they can do is throw up spare servers and hope the DDoS doesn't follow them.
[...] To expect a pool to have DDoS mitigation that can stop the botnet that hit BTC Guild, Deepbit, and Slush in the past, is insane. There is no way a bitcoin pool can afford that level of service.
I don't know if its the same one hitting ABCPool, or if its a smaller fraction, but if its the same one, no host on the planet is going to be able to keep a bitcoin pool online during it. Bitcoin mining itself is VERY DDOS-like. You'd end up catching the majority of legit traffic as false positives. At best you might keep the website portion online to let people know that the pool is down.
couldn't you just whitelist all the "known" (or at least, say, the "big" known) IP addresses, and block everything else?
That would only work if you're at an ISP that will allow you to add a whitelist at their perimeter. If the DDoSer has enough zombies, they will still take you offline because they can flood the switches in front of your server before a whitelist takes effect.
The largest attacks back in July were over 10 gigabits of traffic. There are very few datacenters that can absorb that when its all headed towards a single internal IP, and even fewer datacenters that will actually allow that kind of traffic to come in without just blackholing you temporarily.
What we were seeing were traffic levels at least hundreds of times the amount we normally handle and from a large amount of different addresses. What happens in such a case is that the internet connection becomes completely saturated, causing random traffic to be dropped before it even reaches our servers. with no full request ever making it in, the server load is paradoxically zero.
Whitelisting known good addresses using the server firewall is useless, because the traffic never even arrives at the server: It's pushed out of the way by all the unwanted traffic. So you'd have to find a place further away from the server where the pipes are still thick, and do the filtering there.
Amazon provides a few limited ways to do filtering, and there are several badly documented practical obstacles you run into when using their methods to combat heavy traffic.
So what have we changed to combat possible future attacks? We don't want to give future attackers too much ammunition, but some of the improvements are:
* Load balancers for all traffic.
* 'Other' Changes in network setup.
* Reporting and blocking of inefficient workers.
* Longpoller prioritization based on efficiency (this will also help prevent stales during normal operation!)
NB: in this context, efficiency is defined as the ratio of shares received per work unit handed out.
An interesting thing to note is that blocking will only go into effect whenever the load becomes extreme.
With these features deployed we hope we will be better prepared the next time an attack happens, although we won't know for sure until then..
In the meantime, enjoy the renewed stability of ABCPool
happy hashing everyone!
MC