I managed to reproduce the performance state bug and that's how I did it.
Since it only happened recently, I thought of what changes I did to the system.
The only one was: creating a xorg.conf for all the cards in order to be able to change the fan speeds using the cool_cpu2.sh script.
That means it starts an X server at boot, while it didn't before.
If you leave the X server running, no problems.
If you stop it by running "sudo service lightdm stop", the hashrate bug starts to happen after the following ccminer ctrl-c.
If you start lightdm again, THE ISSUES IS GONE, without the need to reboot :-)
Very interesting. If you didn't start an X server at boot I presume you were in run level 3 and possibly headless.
Your result suggests the presence of the X server has some effect on the problem. It doesn't explain all the
failure modes (ie it also happens on Windows and has happened to me with an X server running) but the
ability to reproduce it is a big step.
All of my Linux systems use run level 5 and therefore start X at boot. My experience with the degradation started
before I fudged xorg.cong to get coolbits on my second card, and it occurred on my primary card, ie the one with
the X server running. I don't recal if the problem predates me adding coolbits to my primary card. All this to say
that the degradation can also occur with an X server running.
I don't know how Nvidia manages card performance levels, whether the card's FW is responsible fo reacting
to load or whether the driver is supposed to tell it to change levels. Either way it's not happening in some
cases.
While in the degraded state my display still works normally and the card can still hash so I assume the card is
still sane and would probably hash at full speed if it switched to the higher performance level.
So the question is how are performance levels managed and why is the process failing. I'm going to take a guess
that performance levels are based on non-cuda related card functions and that cuda applications can't or don't
affect performance levels directly. In the absence of another trigger to raise the performance level the cuda app
is left to run on a degraded gpu. Starting an X server may be such a trigger.
Is there a way that ccminer could set the pstate of the GPUs at launch? This would confirm whether the GPU is
responsive to pstate changes, and if successful is a good workaround for the problem. And it should work on
Windows and Linux.
It's lots of speculation about an architecture I know little about but maybe there is something usefuil here.