vtcpool.co.uk downtime (now resolved) - post mortemOK, so the issues on vtcpool.co.uk are absolutely resolved now. I figured I owed you guys an explanation for the outage.
Yesterday, the pool server had to be upgraded, with the huge growth of the VTC site and network, the pool was getting very overloaded - it came out of nowhere, one minute it was OK, the next users and hashrate had doubled in a few hours.
I upgraded to a bigger VPS yesterday, and that lasted, oh about 12 hours - if that. Users and hashrate continued to grow at an unbelievable rate, so I talked to the MPOS developers, found out about scaling really big pools, and elected to move the MySQL server onto a separate, top of the range VPS with 8 cores, keeping stratum and the web frontend (as well as the p2pool node) on the quad-core server that runs the main site.
I prepared a process, tested, and at 12:05 today, took down the site to quiesce the database, move it to the new server, and repoint the frontend at the new backend. I tunneled the connections from the frontend to the backend over an SSH link to protect any confidential user data in the traffic between the two boxes from being sniffed off the wire at the datacentre.
When I brought the server back up, SQL load was unbelievable - load averages of around 80.
I thought at first it was a stratum issue, so stratum went up and down a few times as we tried to increase logging verbosity and find the issue. I could see users connecting, but no more than I would expect, yet the load on the backend stayed insane with nothing in the stratum logs to explain why.
I eventually tracked the issue down - the frontend (the website itself) uses php extensively. PHP5-FPM has a concurrency model that spawns numerous child processes - these are what actually serve the user requests, and to improve performance I had configured a large number of these child processes while the SQL was local to the frontend.
With the SQL now being remote, this number of child processes was causing issues related to opening the sockets through the SSH tunnel. That was what was loading the database to such an extent, and I have now resolved it.
By the way, you all got double paid on one round of the payouts. What I think happened was, during all the confusion with SQL grinding to a halt, a payout cron-job has run, but SQL has been so slow, it didn't get logged as "paid" in the database before the next time the script ran.
As a result, that payout was made twice. I have taken it in the ass for about $1150 at todays price - I had to send 315VTC to the pool wallet from my own private funds to ensure users payouts wouldn't fail on the next run:http://explorer.vertcoin.org/tx/c2fc3f045e90e59da14ce3a6fe13abdcb7e7f9802dd3e2d8ee947e1fa0f74c92Although I'd like to make it all up to you, and am happy to let it ride, this is a lot of Vertcoin to lose, and I have been working hard to try to maintain a high level of service, including these server upgrades which are costing me around $250 a month on their own, not to mention that this has been a full-time job for me the past few days with the constant need to keep up with network growth.
I chose to ran a pool, and as a result I have to deal with stuff like this, but
if any of you are feeling particularly generous, and notice 2 payments very similar in size made to you around 15:30 (GMT/UTC) today which was when the double payout happened (they may also appear at a different time when they actually arrive with you - I'm not sure), I'd be hugely grateful if some of you might tip me a bit back: ViPBVm4sbXT38h9J23GkAWfNKiuPwVUYQXOnce again, sorry for the inconvenience and downtime.