I ran into another bug. This one could be a fatal problem to p2pool if the network hashrate ever falls abruptly and severely (e.g. more than 10x), as it did when I switched over to my new fork. The problem stems from this code:
timestamp=math.clip(desired_timestamp, (
(previous_share.timestamp + net.SHARE_PERIOD) - (net.SHARE_PERIOD - 1), # = previous_share.timestamp + 1
(previous_share.timestamp + net.SHARE_PERIOD) + (net.SHARE_PERIOD - 1),
)) if previous_share is not None else desired_timestamp,
Basically, the share timestamp is allowed to increase by no more than 61 seconds per share. If the hashrate suddenly falls by more than 2x, this means that the average share will take more than 61 seconds of real time and will have its timestamps clipped. Ultimately, this can result in anomalous minimum share difficulty calculations.
What happens is that you get a time backlog, and all shares have timestamps 61 seconds after the previous one regardless of how long they actually took. This means that the next share will have lower difficulty than the previous one no matter what, even if it actually took 1 second. As each share has lower difficulty, but still apparently takes 61 seconds, the estimated pool hashrate drops even further, which causes the difficulty to drop exponentially, until p2pool can no longer process shares as fast as they're submitted. Then everything goes to hell and you start getting 100 DOAs for every valid share. In my case, it seems to have resulted in share difficulties around 600 before things crashed, which meant several shares were being found every second. For comparison, the current difficulty on the legacy chain (the one you guys are all using) is around 4.8 million.
Since we currently have a single miner with about 3/4 of the network hashrate, if that miner chose to leave p2pool all at once, that might be enough to trigger this bug, although probably not as severely as I did.
I'll have to think a bit about how to fix this. It's probably not a trivial "Let's get rid of the clipping!" thing, since the clipping should be there to protect against malicious miners manipulating the timestamps and consequently the share difficulty. I think using something like Bitcoin's rule of not accepting shares more than 2 hours in the future could be good, but that will require that everybody have reasonably accurate clocks. Which might be worth requiring anyway.
Edit: Yeah, I think the right thing to do is probably to simply remove the clipping and also to reject any shares that are timestamped more than maybe 300 seconds in the future. Anyone who has an incorrect clock setting will either see a lot of error messages from rejecting others' shares plus a high share orphan rate if their clock is slow, or simply a high orphan rate if their clock is fast. As mining is basically a timestamping operation, there's a strong case to be made that keeping accurate clocks is a miner's duty.
The death spiral in graphs:
Note how the network traffic increases when the estimated hashrate hits zero, and how peers become unable to maintain a connection.
In order to recover from this bug, I had to restore a backup of the share chain that I made yesterday, so I lost about a day's worth of hashing at around 700 TH/s.