okay, I have just replaced the ailing PSU in my main development PC, which allows me to put more stress on the GPUs again without it turning off unexpectedly.
So that regression really is bad. 254 kHash/s to 204 kHash/s with same kernel launch parameters between 2013-12-18 and current github.
That's a 20% drop in performance. I might play around a bit to see what I can find.
I did not find the same problem with the T kernel, even though it underwent very similar changes!
EDIT1: the majority of the discrepancy stems from my redefinition of what "warp" means in Dave's Kepler kernel (to be more in line with the CUDA definition of a warp) Hence the equivalent launch config for the current github release has to use four times the number of blocks to be comparable. So I have to go from -l K7x32 to -l K28x32. Then I end up with a drop from 254 kHash/s to 220 kHash/s only. Still bad, but not quite that much.
EDIT2: I find my "simplifications" in read_keys_direct and write_keys_direct to be the culprit. Turns out this has a huge performance impact, despite requiring much less instructions.
Christian