OVERLOAD!
Background:
In the early days of my mining project, I had managed to get 2 rigs stable.
They were built in succession, so the RIG1 had been running non-stop for some 7~8 weeks, ahead of RIG2.
These rigs used a few different PSUs, all Corsair 1000W, RM1000x, HX1000, HX1000i.
At about the 3 month point, RIG1, hit a bump, "random" reboot, (it's was running a USB watchdog, so I was unsure if this was the trigger, and disabled it for testing purposes), equally, event logs only mentioned kernel power, which is pretty vague.
Fortunately, things went downhill fast, and I stripped the rig down for inspection.
Cutting to the chase here, the problem was traced to the SATA/Peripheral 6pin plug on the PSU itself.
In this case, the connector was melted on the 12V pin, the pin welded into the cables socket, and all the insulation crumbled on the plug/socket, and melted some 2cm up the wire itself.
Now, I had failed to appreciate a number of things, and these are worth sharing.
The plugs are rated 10A, but in reality, one should not assume to work anywhere close to that "limit" for 24/7 operation.
Usually about half that is a safe maximum.
Why I didn't pay attention to a glaringly obvious stupidity on the part of Corsair is my own fault.
The Corsair, (and cables of all the PSU vendors for that matter), all have 4 Molex connectors in parallel. Now each Molex is rated to 10A, so how can one expect to pull half that, x4, (20A) through a single 10A plug/socket on the PSU?
Or if you want kindergarten logic, pull 4x 10A (molex) (40A) through a single 10A outlet?
I had stupidly assumed the PCIe risers were mostly inert, as power would in the whole come direct to the GPU.
WRONG!
In all cases I had used all 4 of those cable headers, whether those be molex or SATA power, and in all cases there were clear signs or damage after 3 months. (<2 months on rig2!)
I've rewired all my rigs, maximum 2 devices (PCIe risers, SSD, relay, fan-banks) per cable.
The Corsair PSU in all cases failed to trip safety cutouts, at least the HX1000i didn't log that it had, the RM1000x and HX1000, have no datalink so maybe they did, I have no way to tell.
But equally, this kind of failure is unlikely to cause a trip out, because the PSU is actually delivering LESS power, as current flow is inhibited by the degrading plug/socket!
I shipped 3 PSU back to Corsair.
Signs to look for.
Has a previously stable rig, become unstable, and you didn't change anything, drivers, hardware, etc?
Do you have more than 2 devices on a single power cable?
Checks/Inspections
Power off all PSUs, remove AC plugs, then remove the cable-sockets from the PSU outlet plugs.
Inspect the pins in the PSU outlet plugs.
Warning signs include: Pins no longer shiny/silver/gold, but dull, oxidised, black or burned.
Gently flex the cable near the FAR end of the cable, (farthest from the PSU), this should give you a feel for GOOD cable flex, (normal for THAT cable). Now repeat that flex-test at the other end of the cable, (right where it comes out of the PSU plug).
If you feel less flex, or it could be like a solid rod, no flex, that is a clear sign of overload.
This is a vicious cycle, as you overload, the wires heat up, they expand and abraid on each other, and oxidise. They do this more, right next to any connection, because there is a break in the insulation, allowing the ingress of air, and the oxygen component accelerates the oxidisation. The cables loose flexibility because of this, and also heat up more, accelerating their demise, in extreme cases this will melt away the wire insulation, and even the plug/socket.
If there was ANY human element during manufacture, (skin oils on the wires during handling, crimping, poor stripping, stand damage, poor crimping, bruising), these will drastically increase the likelihood of failure, especially if you pull more current through them.
ALL these signs were present on all my RIGs!!!
All have been rewired 2 PCIe risers per power cable, and 6 weeks on, all reinspected and there were no signs of degradation. A further inspection was done 2 months later, and again there were no signs of overload.
I have a feeling there will be MANY miners out there, who didn't give much thought to plug/socket ratings, and trusted the vendors would be using safe practice, WRONG!
Think: National Lampoons Christmas Vacation, (the Christmas lights scenes), because Evga, Corsair, and all the other vendors are shipping time bomb cables, and NO WARNINGS on them, or the PSU manuals.
Using 3 of the 4 headers might be ok, I opted for maximum of 2, because this is standard/safe practise in situations like this.
REPEAT:
Signs to look for.
Has a previously stable rig, become unstable, and you didn't change anything, drivers, hardware, etc?
Do you have more than 2 devices on a single power cable?
Finally: Don't jump to conclusions about Claymore or OS, or drivers, until you're satisfied the hardware is OK!
Good luck everyone.
Sorry to rehash the whole post, but damn I wish you had posted this two months ago - i spent weeks and plenty cash changing so many components only to find them burnt on the SATA connections on the psu port. We always look at risers etc but the real cost here is psu as they cost the same as a gpu and it isn’t safe based on this principle on a 6 card rig anymore. Cheers iSux- thanks for sharing