i got 150 khs/s with a 670 gtx
I've quickly tested the 1D and 2D texture cache with a GTX 660Ti. The achieved values during hashing remain in the range of 153-155 kHash. Or in other words: almost indentical to operation without the cache which is 154 kHash. All generated shares are are valid.
So while the cache feature is not immediately useful, it may become useful as soon as we start shrinking the scrypt scratchpad.
Here is what a LOOKUP_GAP implementation does: it only saves every N'th value out of 1024 writes to the scratchpad, thereby reducing the bandwidth needed for writes by factor N. One "value" here is a vector of 32 uints (128 bytes in total). 1024 * 128 bytes is your typical 128 kbytes scrypt scratchpad. Here's some actual C-code for the programmers among you.
uint32_t X[32]; int i,j,k;
for (k = 0; k < 32; k++) X[k] = input[k];
// write phase to scratchpad
for (i = 0; i < 1024; i++) {
memcpy(&V[i * 32], X, 128);
xor_salsa8(&X[0], &X[16]);
xor_salsa8(&X[16], &X[0]);
}
// read phase from scratchpad
for (i = 0; i < 1024; i++) {
j = 32 * (X[16] & 1023);
for (k = 0; k < 32; k++) X[k] ^= V[j + k];
xor_salsa8(&X[0], &X[16]);
xor_salsa8(&X[16], &X[0]);
}
for (k = 0; k < 32; k++) output[k] = X[k];
Unfortunately we still have to do 1024 reads from the scratchpad and perform an increased amount of computation to re-synthesize the values that we ommitted during writing. For a LOOKUP_GAP of 4 we would (on average) have to run xor-salsa 2 extra times per lookup. But here's where the cache may come handy. Say you have reduced the total number of scratchpad values from 1024 to 256 using a LOOKUP_GAP of 4. That may increase your cache hit ratio because you are now going to read each element 4 times on average, therefore boosting your effective memory bandwidth for the reads. Unfortunately the lookups are done in a completely random order which causes a lot of cache lines to get replaced quickly.
Problematic is that the numbers quoted above are per hash, and your typical GPU does a few dozen to a few hundred hashes in parallel on its various multiprocessors. And pre-Kepler, each multiprocessor only has 6-8k of texture cache. On Kepler one SMX has 48kb of texture cache. How are the numbers going to work out? I don't really know yet. The cache seems awfully small for what it needs to cover.
Christian