I am in the process of finding the optimal voltage for my PowerColor HD 7850 (AX7850 2GBD5-DH) with 85.3% ASIC quality - and learning a lot of things on the way :-)
I have two cards (7850 & 7970), both undervolted, so the cause for a crash is never 100% sure (or do you have a criterion?), and none of the tools are perfect; even though e.g "GPU-Z" is free and amazing, and it beautifully logs all hardware values into a file, once per second - that logfile is always only for one card at a time, how stupid.
Please help me with my reasoning now. If you have knowledge that can help us to understand this better, please share.
The cgwatcher logfile says:
... after 5 hours of happy mining ...
[05/01/2014 05:30:53] CGMiner (4676): Network difficulty is now 320.
[05/01/2014 05:31:23] CGMiner (4676): Network difficulty is now 253.
[05/01/2014 05:39:50] CGMiner (4676): Network difficulty is now 3,195.
[05/01/2014 05:43:51] CGMiner (4676): Network difficulty is now 253.
[05/01/2014 05:49:33] CGMiner (4676): Network difficulty is now 47.
[05/01/2014 05:49:54] CGMiner (4676): Network difficulty is now 49.
[05/01/2014 05:50:14] CGMiner (4676): Network difficulty is now 51.
[05/01/2014 05:51:04] CGMiner (4676): Network difficulty is now 54.
[05/01/2014 05:51:14] CGMiner (4676): Network difficulty is now 3,195.
[05/01/2014 05:54:25] CGMiner (4676): Network difficulty is now 253.
[05/01/2014 05:55:25] CGMiner (4676): GPU0 (AMD Radeon HD 7900 Series) status is INACTIVE
[05/01/2014 05:55:25] CGMiner (4676): GPU0 (AMD Radeon HD 7900 Series) status is UNKNOWN
[05/01/2014 05:55:28] CGMiner (4676): Process closed.
So here, the 7850 is not even mentioned.
But at 5:55:25 something happens to the 7970 card, for sure - and within 3 seconds cgminer is dead.
Does a Catalyst driver crash look like that?
I don't suspect the 7970 card to be the cause, because I underclocked it first, and it was running stable for more than a day now.
But still, you never know. They sit tightly together, one overheating might trigger the other one?
I have more insight about the other card, a 7850 PowerColor. Fortunately, GPU-Z logged that one.
The columns are:
* Date
* GPU Core Clock [MHz]
* GPU Memory Clock [MHz]
* GPU Temperature [°C]
* Fan Speed (%) [%]
* Fan Speed (RPM) [RPM]
* GPU Load [%]
* Memory Usage (Dedicated) [MB]
* Memory Usage (Dynamic) [MB]
* VDDC [V]
The 20 minutes before were without any change:
2014-01-05 05:33:00 968 1289 62 84 3839 94 1123 41 0.939
2014-01-05 05:34:00 968 1289 62 84 3845 93 1123 41 0.939
2014-01-05 05:40:00 968 1289 61 84 3858 97 1123 41 0.939
2014-01-05 05:53:00 968 1289 61 84 3844 95 1123 41 0.939
Then something happened to the memory usage.
All over the logfiles, I do see seldom jumps in the memory usage,
and they don't (?) seem to sync with pool / difficulty change
(which is logged in the cgwatcher.log above).
Anyways, that is the first real change of values here, 2 minutes before the crash:
2014-01-05 05:53:14 968 1289 62 84 3834 94 1123 41 0.939
2014-01-05 05:53:15 968 1289 61 84 3834 95 1123 41 0.939
2014-01-05 05:53:16 968 1289 62 84 3838 94 1123 41 0.939
2014-01-05 05:53:21 968 1289 62 84 3838 63 1123 41 0.939
2014-01-05 05:53:21 968 1289 59 84 3864 58 1123 41 0.939
2014-01-05 05:53:23 968 1289 60 82 3833 51 1079 48 0.939
2014-01-05 05:53:26 968 1289 60 83 3838 51 1057 38 0.939
2014-01-05 05:53:27 968 1289 60 82 3821 79 1097 40 0.939
2014-01-05 05:53:27 968 1289 60 82 3829 79 1104 40 0.939
2014-01-05 05:53:28 968 1289 61 83 3833 84 1111 41 0.939
2014-01-05 05:53:29 968 1289 60 83 3847 86 1107 41 0.939
2014-01-05 05:53:30 968 1289 61 83 3846 84 1107 41 0.939
apart from that nothing special happening.
2014-01-05 05:54:00 968 1289 60 80 3808 95 1107 41 0.939
then first (!) and strangely, the fanspeed is slowing,
then the GPU temperature falls a bit:
2014-01-05 05:54:01 968 1289 60 80 3809 93 1107 41 0.939
2014-01-05 05:54:02 968 1289 60 80 3801 94 1107 41 0.939
2014-01-05 05:54:03 968 1289 60 80 3801 95 1107 41 0.939
2014-01-05 05:54:04 968 1289 60 79 3783 94 1107 41 0.939
2014-01-05 05:54:05 968 1289 60 79 3780 94 1107 41 0.939
2014-01-05 05:54:06 968 1289 60 79 3794 95 1107 41 0.939
2014-01-05 05:54:07 968 1289 60 79 3782 94 1107 41 0.939
2014-01-05 05:54:08 968 1289 60 79 3781 93 1107 41 0.939
2014-01-05 05:54:09 968 1289 60 79 3786 93 1107 41 0.939
2014-01-05 05:54:10 968 1289 60 78 3748 93 1107 41 0.939
2014-01-05 05:54:11 968 1289 60 78 3750 94 1107 41 0.939
2014-01-05 05:54:12 968 1289 60 77 3740 93 1107 41 0.939
2014-01-05 05:54:13 968 1289 59 77 3745 92 1107 41 0.939
2014-01-05 05:54:14 968 1289 59 77 3741 92 1107 41 0.939
2014-01-05 05:54:15 968 1289 59 77 3739 93 1107 41 0.939
2014-01-05 05:54:16 968 1289 59 76 3724 93 1107 41 0.939
2014-01-05 05:54:17 968 1289 59 76 3711 93 1107 41 0.939
2014-01-05 05:54:18 968 1289 59 76 3718 93 1107 41 0.939
2014-01-05 05:54:19 968 1289 59 76 3727 94 1107 41 0.939
2014-01-05 05:54:20 968 1289 59 75 3700 93 1107 41 0.939
2014-01-05 05:54:21 968 1289 59 75 3690 94 1107 41 0.939
2014-01-05 05:54:22 968 1289 59 74 3674 93 1107 41 0.939
2014-01-05 05:54:23 968 1289 59 74 3670 93 1107 41 0.939
2014-01-05 05:54:24 968 1289 59 74 3678 94 1107 41 0.939
2014-01-05 05:54:25 968 1289 59 74 3677 93 1107 41 0.939
2014-01-05 05:54:26 968 1289 59 73 3657 93 1107 41 0.939
2014-01-05 05:54:27 968 1289 59 73 3659 93 1107 41 0.939
2014-01-05 05:54:28 968 1289 59 72 3647 93 1107 41 0.939
2014-01-05 05:54:29 968 1289 59 72 3633 93 1107 41 0.939
2014-01-05 05:54:30 968 1289 59 72 3623 92 1107 41 0.939
2014-01-05 05:54:31 968 1289 58 72 3623 92 1107 41 0.939
2014-01-05 05:54:32 968 1289 58 71 3609 93 1107 41 0.939
2014-01-05 05:54:33 968 1289 59 71 3606 93 1107 41 0.939
2014-01-05 05:54:34 968 1289 58 70 3596 93 1107 41 0.939
2014-01-05 05:54:35 968 1289 59 70 3585 94 1107 41 0.939
2014-01-05 05:54:36 968 1289 58 69 3552 93 1107 41 0.939
2014-01-05 05:54:37 968 1289 58 69 3536 93 1107 41 0.939
2014-01-05 05:54:38 968 1289 58 68 3524 93 1107 41 0.939
2014-01-05 05:54:39 968 1289 58 68 3514 93 1107 41 0.939
2014-01-05 05:54:40 968 1289 58 67 3488 93 1107 41 0.939
2014-01-05 05:54:41 968 1289 58 67 3460 93 1107 41 0.939
2014-01-05 05:54:42 968 1289 58 66 3446 93 1107 41 0.939
2014-01-05 05:54:43 968 1289 58 66 3433 94 1107 41 0.939
2014-01-05 05:54:44 968 1289 58 65 3414 94 1107 41 0.939
2014-01-05 05:54:45 968 1289 58 65 3401 94 1107 41 0.939
2014-01-05 05:54:46 968 1289 58 64 3378 94 1107 41 0.939
2014-01-05 05:54:47 968 1289 58 64 3350 94 1107 41 0.939
2014-01-05 05:54:48 968 1289 58 63 3336 94 1107 41 0.939
2014-01-05 05:54:49 968 1289 58 63 3320 94 1107 41 0.939
2014-01-05 05:54:50 968 1289 58 62 3301 94 1107 41 0.939
2014-01-05 05:54:51 968 1289 58 62 3286 94 1107 41 0.939
2014-01-05 05:54:52 968 1289 58 61 3259 94 1107 41 0.939
2014-01-05 05:54:53 968 1289 58 61 3227 94 1107 41 0.939
2014-01-05 05:54:54 968 1289 58 61 3223 94 1107 41 0.939
2014-01-05 05:54:55 968 1289 58 61 3225 94 1107 41 0.939
2014-01-05 05:54:57 968 1289 58 60 3208 94 1107 41 0.939
2014-01-05 05:54:58 968 1289 58 60 3196 95 1107 41 0.939
2014-01-05 05:54:59 968 1289 58 59 3182 93 1107 41 0.939
2014-01-05 05:55:00 968 1289 58 59 3165 93 1107 41 0.939
2014-01-05 05:55:01 968 1289 58 59 3161 93 1107 41 0.939
2014-01-05 05:55:02 968 1289 58 59 3163 93 1107 41 0.939
2014-01-05 05:55:03 968 1289 58 58 3135 92 1107 41 0.939
2014-01-05 05:55:04 968 1289 58 58 3103 92 1107 41 0.939
2014-01-05 05:55:05 968 1289 58 57 3082 94 1107 41 0.939
2014-01-05 05:55:06 968 1289 58 57 3065 94 1107 41 0.939
2014-01-05 05:55:07 968 1289 58 57 3058 93 1107 41 0.939
2014-01-05 05:55:08 968 1289 58 57 3061 94 1107 41 0.939
2014-01-05 05:55:09 968 1289 58 56 3047 94 1107 41 0.939
2014-01-05 05:55:10 968 1289 58 56 3030 94 1107 41 0.939
2014-01-05 05:55:11 968 1289 58 55 3002 93 1107 41 0.939
2014-01-05 05:55:12 968 1289 58 55 2969 93 1107 41 0.939
2014-01-05 05:55:13 968 1289 58 54 2946 92 1107 41 0.939
2014-01-05 05:55:14 968 1289 58 54 2925 92 1107 41 0.939
2014-01-05 05:55:15 968 1289 58 54 2922 92 1107 41 0.939
Then the GPU load collapses within 5 seconds,
almost all memory is suddenly freed,
GPU rises again to 64% for five seconds ...
... then the GPU load drops to zero, and the clockspeeds go idle:
2014-01-05 05:55:16 968 1289 58 54 2921 92 1107 41 0.939
2014-01-05 05:55:17 968 1289 58 53 2907 92 1107 41 0.939
2014-01-05 05:55:18 968 1289 58 53 2885 92 1107 41 0.939
2014-01-05 05:55:19 968 1289 58 53 2877 94 1107 41 0.939
2014-01-05 05:55:20 968 1289 55 53 2887 80 1107 41 0.939
2014-01-05 05:55:21 968 1289 54 53 2891 60 1107 41 0.939
2014-01-05 05:55:22 968 1289 54 53 2890 47 1107 41 0.939
2014-01-05 05:55:23 968 1289 54 53 2889 29 1107 41 0.939
2014-01-05 05:55:24 968 1289 53 53 2889 9 1107 41 0.939
2014-01-05 05:55:25 968 1289 53 53 2887 64 79 23 0.939
2014-01-05 05:55:26 968 1289 53 53 2887 64 80 23 0.939
2014-01-05 05:55:27 968 1289 53 53 2889 64 80 23 0.939
2014-01-05 05:55:28 968 1289 52 53 2890 64 76 23 0.939
2014-01-05 05:55:29 968 1289 52 53 2892 64 76 23 0.939
2014-01-05 05:55:30 300 150 52 53 2890 0 76 23 0.825
2014-01-05 05:55:32 300 150 50 53 2891 0 76 23 0.825
2014-01-05 05:55:33 300 150 49 53 2891 0 76 23 0.825
2014-01-05 05:55:36 300 150 48 53 2891 0 76 23 0.825
2014-01-05 05:58:17 300 150 31 53 2847 0 76 23 0.825
2014-01-05 06:00:27 300 150 30 53 2828 0 76 23 0.825
2014-01-05 06:00:50 300 150 29 53 2827 0 76 23 0.825
Where it stays, cooling down ...
2014-01-05 05:58:17 300 150 31 53 2847 0 76 23 0.825
2014-01-05 06:00:27 300 150 30 53 2828 0 76 23 0.825
2014-01-05 06:00:50 300 150 29 53 2827 0 76 23 0.825
2014-01-05 06:30:00 300 150 28 53 2805 0 86 23 0.825
Until I find it half an hour later - because my little port4028 watcher-tool looks suspicious.
Hmmm ... that was all 7850 data. which I am currently testing to undervolt.
Unfortunately, I do not have parallel 7970 data :-(
Perhaps the 7970 malfunctioned first?
For now, I have raised the voltage of the 7850 a little bit.
Let's see if that is enough for stability.
What do you think, which card was the cause of the above crash?