EDIT: I was trying to use the tesla kernel. It doesn't appear to like Jane. I can squeeze out 2.3 KH/s on my gtx 780 using K9x2. Going to keep tinkering.
2.4 kHash with K4x4 on GTX 660Ti, I also use -C 1
Tesla, Fermi and Legacy Kernels don't do scrypt-jane yet.
"GPU #0: Given launch config 'K4x4' exceeds limits for 1D cache." No warnings about -C 2, but the results don't validate.
6x3 and 9x2 with -C 0 give me the best results at ~2.3 and ~2.35 kh/s. Unfortunately my machine becomes annoyingly slow with any scrypt-jane config as it stands. With K9x2 I have 40MB of free gpu memory, with K1x1 over 2GB are free, yet my computer still approaches unusable, and as bathrobehero said, -i 1 doesn't seem to help either.
I'm thinking of picking up a 4GB GT 630 for $30 to play around with. It seems to match your criteria for being efficient. 96 shaders, 128 bit bus, lots of ram. I'm curious what kind of hash rate it can pull, but mostly I want to play around with scrypt-jane without crippling my dev box.
On another note (regarding regular scrypt), I've just been using the 12-18-2013 commit from github until yesterday... but with the changes from the 20th (autotune up to 32 warps) I get another 20 kh/s with T15x32. I played around with the Kepler kernel and noticed -C 1 adds nearly 50 kh/s. Is the texture cache a possibility for Tesla kernels in the future?
cbuchner1, you are a beast!
I concur. I will have to find some LTC to send your way to show my appreciation.
4GB GT 630 for $30 : wow! good price. I have yet to enable the Fermi kernel for scrypt-jane though.
Tesla kernels don't need to explicitly enable a texture for cached reading, as they automatically pull their data through this cache (look up what the __ldg intrinsics do in the latest CUDA programming guide)
I might try to figure out a way to chop the scrypt-jane kernels into a series of smaller kernel launches, which may make make it less taxing on the display and also allowing the use of interactive mode again.
Titan kernels are now scrypt-jane enabled! I get 3.2 kHash/s on GTX 780Ti using -l T7x3 now. And power use is cut in half compared to LTC mining. What a pity the 780Ti doesn't have 6 Gigs of RAM, or I could use -l T14x3, doubling the speed. Someone should try this launch config with the 6 GB Geforce Titan models though. Could yield some 6 kHash/s.
I also have a crazy idea that would basically remove the memory limitations for scrypt-jane mining. It requires joining the A and B kernels into a single kernel again and re-using the scratchpad memory on the GPU. So instead of giving each thread a unique 4 MB scratchpad, we may be able to reuse the same scratchpad memory for all non-concurrently executed thread blocks. I think this is a similar concept that the "intensity" parameter on the ATI cards is controlling when running cgminer. Unfortunately this idea might be incompatible with the texture cache, as this cache does not guarantee read/write coherency within a single kernel invocation. But hey, it could get my 780Ti's to 6 kHash/s...maybe.
EDIT: okay, I made a mistake in my thoughts here. with so few thread blocks running on the GPU, ALL of them would be executing concurrently. And hence the memory reuse concept falls flat.
Christian