Eagerly awaiting full status report
What happen?On Saturday's evening (UTC), bitcoin daemons on two independent pool servers ate all available memory (RAM & swap) and fall to massive I/O operations, which make servers extremely slow. Central server switched traffic to the last machine, but this machine was unable to handle all traffic and some RPC calls started to timeout. Then those miners getting RPC timeouts probably started to ask server even more frequently (miners don't implement any "polite waiting" in the code and send new RPC call immediately after error/timeout). Those thousands of requests per second were the last piece to shut down last working machine
.
Why so many invalid blocks?Then central server tried to balance traffic between all three overloaded servers, which lead to partially working, but invalid blocks generating pool. This was caused by huge I/O latencies on first two crashed machines: when bitcoind is waiting to I/O operations (usually processing new bitcoin block), it is still accepting getwork submits, but submitted work is compared to non-actual blockchain.
Why bitcoind initially crashed?I never see this before, which is also the reason why there wasn't any watchdog or another mechanism recovering pool from this type of issue. Actually restarting of bitcoind when they started to eat too much memory *should* be enough for the future, but threshold for doing that automatically is of course hard to test.
Was it an attack?I have no evidence for following claims, but makes sense for me: I heard some rumors on the forum/IRC that is possible to overload bitcoind by sending malicious messages to its P2P port. I can imagine attacker waiting till I'll be offline and then started with such attack. Fix was pretty easy: bitcoind restart helped, so attack while I'm online could be recovered almost instantly. I also know that attacker who did previous DoS is on #bitcoin-dev channel, so detecting that I'm offline is pretty easy (and I also randomly mentioned that I'm going to few day holiday). I'm usually online all the time, this was second weekend since December 2010 (pool start) in the row when I was offline without any chance to reach Internet connection.
Lesson learna) Don't take holiday
b) If I'll take holiday, don't tell it anybody
.
c) Try to setup watchdog restarting bitcoind when it started to use too much memory.
Finally I'm very sorry for such troubles, but everything is resolved now, pool hashrate is again slowly climbing up and I'll try to setup such watchdog before I'll go offline next time.