I've put a few commits in my git:
https://github.com/kanoi/cgminer/that add a simple device history that is accessible via the new API command 'notify'
Compiling my git reports itself as 2.3.1k
You can see it with
echo -n notify | nc 127.0.0.1 4028 ; echo
The base code change adds a few extra fields and counters to the device structure (that are all reported by the API)
Including: per device: last well time, last not well time, last not well reason, and counters for each of the reasons meaning how many times they have happened (e.g. Device Over Heat count, Device Thermal Cutoff count among others)
I ran for 30 minutes at stock gpu clocks and several gpu threads restarted... again on the GPU Managment screen, it only showed that gpu 5 had been re-initialized according to the times tamps. Seems to be random cards.... first time i looked it was 2,3 and 5. I restarted and this time its 1,3,4, and 5... after 30 minutes at stock gpu clock.
Here's the output of your command:
STATUS=S,When=1332519326,Code=60,Msg=Notify,Description=cgminer 2.3.1k|NOTIFY=0,Name=GPU,ID=0,Last Well=1332519326,Last Not Well=0,Reason Not Well=None,Thread Fail Init=0,Thread Zero Hash=0,Thread Fail Queue=0,Dev Sick Idle 60s=0,Dev Dead Idle 600s=0,Dev Nostart=0,Dev Over Heat=0,Dev Thermal Cutoff=0|NOTIFY=1,Name=GPU,ID=1,Last Well=1332519326,Last Not Well=1332518925,Reason Not Well=Device idle for 60s,Thread Fail Init=0,Thread Zero Hash=0,Thread Fail Queue=0,Dev Sick Idle 60s=1,Dev Dead Idle 600s=0,Dev Nostart=0,Dev Over Heat=0,Dev Thermal Cutoff=0|NOTIFY=2,Name=GPU,ID=2,Last Well=1332519326,Last Not Well=0,Reason Not Well=None,Thread Fail Init=0,Thread Zero Hash=0,Thread Fail Queue=0,Dev Sick Idle 60s=0,Dev Dead Idle 600s=0,Dev Nostart=0,Dev Over Heat=0,Dev Thermal Cutoff=0|NOTIFY=3,Name=GPU,ID=3,Last Well=1332519325,Last Not Well=1332518862,Reason Not Well=Device idle for 60s,Thread Fail Init=0,Thread Zero Hash=0,Thread Fail Queue=0,Dev Sick Idle 60s=1,Dev Dead Idle 600s=0,Dev Nostart=0,Dev Over Heat=0,Dev Thermal Cutoff=0|NOTIFY=4,Name=GPU,ID=4,Last Well=1332519326,Last Not Well=1332517934,Reason Not Well=Device idle for 60s,Thread Fail Init=0,Thread Zero Hash=0,Thread Fail Queue=0,Dev Sick Idle 60s=2,Dev Dead Idle 600s=0,Dev Nostart=0,Dev Over Heat=0,Dev Thermal Cutoff=0|NOTIFY=5,Name=GPU,ID=5,Last Well=1332519326,Last Not Well=1332518716,Reason Not Well=Device idle for 60s,Thread Fail Init=0,Thread Zero Hash=0,Thread Fail Queue=0,Dev Sick Idle 60s=1,Dev Dead Idle 600s=0,Dev Nostart=0,Dev Over Heat=0,Dev Thermal Cutoff=0|
I'll try running on a different pool other than gpumax again see what happens.