The brave among you might want to try out this code repo under Linux (it's a straight fork from pooler's cpuminer with my CUDA additions).
https://github.com/cbuchner1/cpuminerThere are 4 kernels, accessible with the monikers
L - Legacy (for compute 1.x devices)
F - Fermi (for Fermi class devices, compute 2.x)
K or S - Kepler kernel (using spinlocks to guard shared memory, compute 3.0)
T - Tesla (compute 3.5)
to autotune for a specific kernel, you can just pass the letter representing the specific kernel to the -l option, otherwise it just picks the kernel that matches your architecture.
some of the kernels (Fermi and Kepler) have been sped up a bit using optimizations I've received from a nice guy named Alex from Greece. 5-15% speed up can be obtained. Note that sometimes a kernel for an older architecture may run at same or better speed than the kernel for your hardware architecture.
Currently getting 52 kHash/s on GT 640M (compute 3.5) and GT 750M (compute 3.0)
Open issues:
-the Fermi kernel currently doesn't run for me on Linux 64bit on a GT750M with compute 3.0
-still not enough error checking is done in the CUDA code
-that Stratum parse error and subsequent protocol freeze