So, an update on last night's DDoS attack, in timeline form.
- at 04:01 UTC An email was received threatening to DoS eligius with 5 BTC ransom. The full message was "you pay 5 bitcoin or x.x.x.x x.x.x.x x.x.x.x go down you have 1 hour" where x.x.x.x is replaced with some mostly non-public Eligius server IPs, specifically two core bitcoind nodes and the webserver's actual IP. They didn't even provide a contact method or way to pay the ransom (not that I would have done so anyway). Email was filtered as spam and I only found it after the fact also (sent from some bogus email address with a name of "Tina Turner")
- at 04:48 UTC There was a physical fiber connectivity issue at Eligius's primary data center. This resulted in nearly 70% of the data center's local bandwidth capacity being lost. Secondary connections were immediately put to use and were covering the mostly off-peak load and most services, including Eligius, were not impacted. (Spoiler: this turned out to be unrelated to the attackers)
- at 05:16 UTC a DDoS began on Eligius. External traffic saturated the secondary fiber connections from outside. To make matters worse, some servers in the same facility as Eligius (unrelated to us) appear to have been compromised and used to DoS Eligius from inside the data center as well, causing latency, but not triggering my normal DDoS alerts since the external DDoS was mostly UDP amplification and filtered upstream.
- at 05:36 UTC Eligius miners, mostly unaffected by the DDoS's at this point aside from some work update latency, found a block, #387439 with hash 000000000000000008e051b41e7ada11e9931153a0bb02960ebb4a9e0374e404. Eligius nodes attempted to submit this block to our primary nodes, secondary external nodes, and BlueMatt's relay network. The first of these to be able to accept this block happened 46 seconds later due to extreme network latency within the data center. By then the bitcoin P2P network had already seen a competing block, 000000000000000003e98e022c09c263e6b28f79cbc973a094444f649bbe4bcf by AntPool. Eligius miners continued to mine on top of *e404, but about 10 minutes later BitFury found a block that built on top of AntPool's block and Eligius's block was officially stale.
- at 05:57 UTC The NOC staff, dealing mostly with the fiber loss issue, isolated the compromised machines inside the facility (no Eligius machines) that were contributing to the DDoS attack on Eligius.
- at 06:02 UTC The external DDoS against Eligius expanding to include core data center hardware and further saturated the secondary fiber connections beyond usability.
- at 06:03 UTC Eligius's mining servers automatically switched to a tertiary link with extremely limited bandwidth when connectivity was lost. Coinbaser was disabled and block size limited while on this connection to save bandwidth, again all automatically.
- at 06:40 UTC Data center staff had mitigated the DDoS to the point where things were mostly stable on the secondary fiber link, with patches of extreme latency as the pipe was periodically saturated by either the DDoS or normal traffic.
- at 08:03 UTC The majority of the primary fiber connection was restored via a temporary link bypassing the physically damaged fiber link. Latency with Eligius was back to normal at this point.
- at 11:02 UTC Primary fiber link was fully restored via a temporary link with full capacity.
- at 16:30 UTC Data center NOC staff's failure analysis concluded that faulty wiring in an underground conduit near to this section of fiber had caused excess heat which damaged the primary fiber link.
Initially I had thought it possible that the people threatening the attack may had been responsible for the fiber cut on the data center's primary connections. After sharing the timing of the incident and the ransom note with the NOC, they were also skeptical of the coincidental timing. Turns out the two were completely unrelated and were just independent issues. However, if just one or the other had happened there would have been little to no impact to services, Eligius's or other's at the facility. But since the attackers were able to leverage a partly crippled data center when they attacked, their attack was at least partly successful. Fortunately non-Eligius servers at the data center were only mildly affected by this whole situation or I might be accelerating my migration to the new data center beyond my planned time table.
Total actual down time for Eligius's mining servers was under 45 seconds, measured from a remote monitoring server, around 06:02 UTC when the changeover to the tertiary link happened. Unfortunately latency was very bad in bursts between 05:16 and 08:03 UTC, as high as 1 minute latency. For the most part, miner connections were not affected aside from delayed work changes and delayed share submission responses.
The above is all for the mining servers. The web server was unavailable for several minutes at a time while the lower priority DDoS mitigation for that setup took effect.
None of the "wake wizkid057 up" alerts I had setup triggered, since in the eyes of all of my monitoring the mining servers were all online, available, working on up to date work, etc, with DDoS mitigation doing its job correctly. The switch to the tertiary link caused small connectivity loss, but not enough to drop all miner connections, and wasn't quite long enough to trigger an alert either. Overall, the system performed well under the circumstances, IMO, and I'd be surprised if many other hosted services could survive such an incident. Had I been awake and available from the beginning, I don't think there would have been much else I would have been able to do anyway.
But, in any case, I'm working on hardening the back end servers against a similar incident and adding even more redundancy to the block submission path out of Eligius's network. Once the migration to our new data center is complete I'll even have a 4G LTE backup link available for this purpose.
TLDR: There was a DDoS along with some data center network issues last night. All is well now. No Eligius servers were compromised.