Very interesting. I get about 2% gain on 7950 and need to use (mod % 2) with the case statements adjusted accordingly.
My 280X gains almost 6% as is, but the gain difference between (mod % 2) and (mod % 4) is pretty small, like 1-2 KHs
My SMix call is a bit different, I simply put the sub-calls inline so it doesn't bother with ScratchpadStore and ScratchpadMix.
Perhaps this fits nicer into the core and needs less swapping.
I have tried, unsuccessfully, to further streamline the SMix, but any other way I do it, its either all HW errors or vastly slower. Any guidance here would be appreciated.
void SMix(ulong16 *X, __global ulong16 *V, bool flag)
{
int i = 0;
int idx;
while (i^256)
{
V[i++] = X[0];
V[i++] = X[1];
neoscrypt_blkmix(X, flag);
}
do {
idx = (( (uint *)X)[48] & 0x7F) << 1;
X[0] ^= V[idx];
X[1] ^= V[idx+1];
neoscrypt_blkmix(X, flag);
} while (i-=2);
}