Thanks for the tip, mxnsch. I'll definitely look into tweaking some of the values in the future. I did figure out what the problem was, though. Turns out the outlet I had the miner plugged into died, but not completely. Once I turned the miner off and unplugged it, I tested the outlet with a multimeter. The output was, and currently still is, 15VAC. Something that's also a bit more troubling, all other outlets in half of my house are now over-volting, including ones on the same circuit as the dead outlet. All are reading between 155-165VAC, which I know is a dangerous thing when it comes to sensitive electronics. They also have a higher amperage than my multimeter can read, which has a cap of 20A. I'm 99% certain that the miner did NOT cause this problem, but rather was the first thing to be 'punished' by the problem.
But with the good news comes bad news... Plugged the miner back into a stable circuit, turned it on, and cs3 (Board #4) is dead. Board 4 is the one that hit 101*C, which I would imagine any piece of equipment hitting the boiling point means the kiss of death. I'm planning on opening up the rig tomorrow to see if there's any visual damage, as well as reconnecting all of the cables. Will also reflash the SD card just in case. But, as it stands right now, the board is dead as far as I can tell. It always ran hot to begin with, averaging close to 6*C hotter than the other 5 boards, even with super cold winter air. Probably little to no thermal paste on there is my theory...
Last little question, but does anyone know if replacement boards are sold anywhere? Or if anyone, most likely someone on this forum, offers repairs? I don't know the specifics of the multiple ways heat can damage a board, so I'm not sure if it's repairable. Poor old rig has been mine for less than 3 months and this happens lol
Anyways, thanks for your help guys. I'll check back in a few days if I find out anything interesting.
Well at least you got some answers, as for bad boards normally they will blow a chip and then it is done, they cant communicate anymore , I have fixed a few but most of the time it takes more than one chip. As far as thermal paste there problems with all the ones I worked on I tore them all down for some unknown reason on the bottom there was thermal paste on a thermal pad which is worse , I put paste on the bottoms of the board under the chips and on top of them helped a lot with temps and just got rid of the pads all together.
I think you can get replacement boards from zoomhash they were like $85 when I contacted them which means like a bit of time to pay it off. I wouldnt bother with a reflash, are the LEDs different on the dead board?
mjgraham, initially the board had no LEDs at all, but starting today the board 'woke up' and now shows signs of life. The 5 working boards all have solid green/red lights while hashing, and the dead board just continues to flash the LEDs. It's almost as if the Pi isn't recognizing it, or it could be to what you mentioned where once the board blows a chip, it just stops communicating.
I had a theory that perhaps the board might not be receiving sufficient power. When the problem first started, I received this message for all 6 boards, which were all "on" but being powered by the dead power outlet:
[2016-03-27 18:48:49] ACK(cs0) timeout:cmd_POWER_ON_BCAST-0.0553s
[2016-03-27 18:48:49] SPI(cs0) no device
[2016-03-27 18:48:49] ACK(cs0) timeout:cmd_RESET_BCAST - 0.27 ms
[2016-03-27 18:48:49] Failure(cs1)(2): missing ACK for cmd 0x02
[2016-03-27 18:48:49] ACK(cs1) timeout:cmd_POWER_ON_BCAST-0.0575s
[2016-03-27 18:48:49] SPI(cs1) no device
[2016-03-27 18:48:49] ACK(cs1) timeout:cmd_RESET_BCAST - 0.27 ms
[2016-03-27 18:48:49] Failure(cs2)(2): missing ACK for cmd 0x02
[2016-03-27 18:48:49] ACK(cs2) timeout:cmd_POWER_ON_BCAST-0.0487s
[2016-03-27 18:48:49] SPI(cs2) no device
[2016-03-27 18:48:49] ACK(cs2) timeout:cmd_RESET_BCAST - 0.27 ms
[2016-03-27 18:48:49] Failure(cs3)(2): missing ACK for cmd 0x02
[2016-03-27 18:48:49] ACK(cs3) timeout:cmd_POWER_ON_BCAST-0.0499s
[2016-03-27 18:48:49] SPI(cs3) no device
[2016-03-27 18:48:49] ACK(cs3) timeout:cmd_RESET_BCAST - 0.27 ms
[2016-03-27 18:48:49] Failure(cs4)(2): missing ACK for cmd 0x02
[2016-03-27 18:48:49] ACK(cs4) timeout:cmd_POWER_ON_BCAST-0.0489s
[2016-03-27 18:48:49] SPI(cs4) no device
[2016-03-27 18:48:49] ACK(cs4) timeout:cmd_RESET_BCAST - 0.27 ms
[2016-03-27 18:48:49] Failure(cs5)(2): missing ACK for cmd 0x02
[2016-03-27 18:48:49] ACK(cs5) timeout:cmd_POWER_ON_BCAST-0.0509s
[2016-03-27 18:48:49] SPI(cs5) no device
[2016-03-27 18:48:49] ACK(cs5) timeout:cmd_RESET_BCAST - 0.27 ms
After fixing the power problem, 5 boards work, but board #4 (cs3) still gives that exact message. I'm also starting to think that I should reflash the SD because ever since the problem started a week ago, the graphs on the main page refuse to repopulate. All 8 graphs show Saturday (March 26) at 6:00 PM to Sunday at 6:00 PM, right when the temperatures hit 100*C. This is also the case for the 'Historical Statistics', 'Logs - Messages', and 'Logs - Syslog' pages. In my eyes, I see 3 variables that could be the problem: corrupted SD, corrupted Pi/ribbon cable to cs3, or a dead board altogether. However, if the board is dead, that doesn't explain the non-responsiveness of the logging system. Thanks again mjgraham.