WARNING:There will be one or more ckpool restarts at 11:00 UTC - in about 1hr 15min
This will cause a failover and failback of all miners.
The failovers and failbacks will be very quick - less than a minute - so there shouldn't be much down time.
... going with the old days where ckpool use to crash and burn regularly, and we also got a lot of blocks at the same time ...
Lets see if this can get us some blocks
I'll be doing the restarts of ckpool to try and resolve the problem from the other day.
It seems to require a lot of 'pressure' to cause the problem since I can't reproduce it myself in a test environment.
Thus I'll be trying the expected solution first - update every node - and then restart every node after I update the main pool node.
If that doesn't resolve it then I'll switch things back again as needed to get it all working again.
Well, the good news is, updates are all in place. The bad new is
that means we'll accept segwit transactions
Anyway all working after a rather major case of serendipity ...
Ignore below if you're not interested in the gory details
The problem turned out to be that the new version of ckpool can't handle the sockets passed from the old version, so it just sits there spinning it's wheels doing nothing.
This time I decided to do a full upgrade of all the nodes, in advance, and have them ready to be restarted into the latest code if needed.
However, I made sure one node that no one mines to, was on the latest code, before the main restart, to see if that had any effect during the main pool restart.
At 11:01:13 UTC I started the new version on the main pool, including my changes and the hack/fix for the deadlock, that took control from the old version and then it sat there doing exactly the same thing as last time - nothing.
I tried restarting the 'test' node again to see if it could connect ... nope.
So at 11:03:12 UTC, 2 minutes after the update, I 'thought' I started the old version again, and everything switched back to it OK.
Since I had updated all the node's code, I then proceeded to restart every node, since it was worth doing that then also.
By 11:03:53 all nodes were running the new code, and had reconnected to the pool properly, except one.
In the process I'd skipped the NYA node and it was sitting there not working properly either ... which I noticed about 30 seconds after I'd done everything, and restarted it also.
Now everything was back to working ok ... and I 'thought' we were back with the old code on the main pool again (but every node updated)
Due to luck, I'd missed swapping in the old executable when I restarted the main pool the 2nd time ... ... ... ... I was rather annoyed about it all ... ... ... ... but it was all working ok
Thus I got the answer to what was going on and how to get around it if it happens happen.
Mine on