Is the miner using any cpu to enhance the speed? Btw, I've often thought what if some of the hashes are better performed in the cpu than in the GPU? What if some functions were done in the CPU and some in the GPU for max performance? Surely there'll be a waiting impact from CPU-GPU crossover as data flow from GPU ram to system RAM but it may be worth a try to see if something can actually be improved way more (like +20/30/50%)... Then again the performance hit from waiting as data moves from cpu-gpu-cpu might be too large to gain anything.
Well, that was my next planned step. I'd try to execute SIMD on CPU to see if it goes faster, since afaik it is the less parallelizable of all the hashes. But that'l have to wait a bit. Current improvements are based on the use of local memory for hashes, which can be faster than private memory for some uses. The improvement in speed was actually a side effect of a refactoring to make the code more readable and easier to test.