1. read the log and locate which gpu has hanged first or error first. say cpu 0.
2. what I did is pressing 0 to disable GPU 0 in the rig.
3. watch if your rig is stable. If it is, congratulation, you found the sick card.
4. Try one by one till you find it.
or you can just disable watchdog in your bat file, set -wd 0. In this case, the sick card will stop but your other cards keep working. Now touch the card and feel the temp you will know which one is bad.
If you cannot find a card to blame, then you may need to reinstall the windows and update to the latest.
Hope this helps.
Okay. I would like to do this but I am not sure where you disable just one of the GPU's.
Since my device 3 has been the one to stop first the last two times after reading my log, I am guessing that that is the problem. I'm running Ali carts again now to see if I get that same device to fail. And trying to figure out how to disable the deviceand determine which card it actually is.
use nvidia inspector to find the GPU number in question.
what miner app are you using?
assuming ccminer, use the -d flag to choose what devices to use:
-d 0,1,3,4,5
and with ccminer you can trim intensity per-card as well:
-i 17,17,13,17,17