... and we're back !!!!!!
Hello Everybody,
first of all, allow me to apologize for the extended downtime; we were hit by what I can only describe as a perfect storm of circumstances, of both technical and personal nature. I believe that we have fixed the issues we faced, and in the process discovered 2 nasty bugs which were hiding in our codebase but which were only triggered by a specific set of circumstances.
Since I believe that we owe our users an explanation, rather than just tell you that things are fixed, allow me to explain what happened and why things took so long:
1) Yesterday we experienced some temporary cloudflare issues which caused service interruption for some people
2) Once these were supposedly resolved, the system continued to degrade periodically; the reason for this was not clear at the time it happened.
3) We started digging around and due to the earlier cloudflare issue, departed down 2 false/wrong trails of investigation
4) #3 caused us to waste a lot of time digging around various system components which in the end had nothing to do with the issue
5) Around 6AM, after being awake for like 24 hours, I was starting to fall asleep in my chair and decided that I had to get a few hours of sleep (all-nighters are definitely easier when you're college-aged)
6) At 8AM my wife woke me up with the news that our daughter, who the evening before was only 50% sick, now had developed a fever of 39.5 degrees (103 degrees Fahrenheit for you non-metric folks). So off we were to the doctor.
8 ) I returned back home around 11AM, I looked at the code and within 1/2 hour found the culprit.
9) Did some testing and decided that things now worked OK.
So, it was cloudflare, lack of sleep, my daughter's illness and a stupid coding error I had made, all of which combined to prevent a speedy resolution of the issues we faced. The coding error in question was triggered by the initial cloudflare outage and caused bad hashes to propagate throughout the system; my code attempting to fix this caused an infinite loop which explains the 'death' of the web server once this occurred.
So, I apologize and accept the bulk of the blame for the extended outage. The good news is (as stated above) that this caused us to spot 2 nasty errors in our codebase which are now also fixed.
Anyhow, thank you for being patient during this difficult and messy time. I promise to try to get some more regular sleep in the future which should help with avoiding stupid coding errors.
PS: We still have an open ticket with cloudflare about 1 specific issue which they will hopefully address soon. This addresses a minor issue which should not have a site-wide impact on service.
PS2: I still have to restart some services later today but this should only be an outage of a few seconds.
PS3: If you access our system and you're automatically logged in (using the credentials from your last visit), please log out and then log back in. This should force restoration of your server hash and some other parameters to a sane/valid value.