Hmmm, I wonder what Professor Andersen is up to - I cannot improve my RIC CPU miner any further. Perhaps I can steal some of your ideas.
Stuff you may have already done - I moved from a trivially parallel model of independent workers to one using a shared sieve and many workers doing primality testing. Sieve generation is parallelized across about 1/4 of the cores, overlapped with testing of previously-found candidates.
Saves a lot of memory, good for performance, lets you use a larger sieve without hurting performance, and is really annoying to get working optimally on high-core-count NUMA AMD machines, which is why I haven't released it yet. Oh, and the occasional deadlock. :p
This was the next (and one of the last, though not quite yet) bits of low-hanging fruit before having to start really thinking.
I have not yet tried that approach on the CPU, but it was the approach I took on the GPU with one exception - there is no overlapping of sieving and modular exponentiation computations.
If I can find a few minutes of downtime before the end of this month, I will overlap both the sieving and the modular exponentiation phases on the GPU and report back the results. Additionally, I am still yet to take advantage of Dynamic Parallelism and Hyper-Q features (available on both the GTX 780 Ti and GTX 750 Ti devices) which should help with the implementation.