Is this really the bottleneck? I'd have figured that there would be a performance penalty stemming from all the allocation of objects on the heap and have never dug into the inner workings of scrypt this far. Although I have worked with assembly quite a bit, I have never done so in the context of C#, and have assumed that the JIT compiler does a pretty good job of being efficient at the instruction level other than incurring a ton of overhead moving in and out of objects it so cautiously allocates.
How did you arrive at the above? If I try to look for the same piece of code in my app, I get this:
Looks to me like it's going about it a much longer way than what you've quoted. Are we looking at the same implementation? (the code you've quoted isn't even something I could find verbatim in the implementation I used). All of these instructions look like they are accomplishing nothing more than checking that the array references are in bounds, and the code you're quoting doesn't reference an array. In fact it is checking very redundantly in spite of obvious possible optimizations (e.g. checking to see that x[0] is a in bounds immediately after having verified that x[4] is). And it doesn't unroll the call to R(), something I wouldn't have expected anyway.
EDIT: not surprisingly, I get substantially better performance compiling this as a "release" build and then not running it under the debugger, the scrypt operation taking closer to 1 second versus 3 seconds without changing a single line of code, even though the release build still shows all the wasteful bounds checks if I look at the disassembly.
You are right about the boundary checks. They can be avoided with unsafe pointers though. Also, C# doesn't do inlines well.
So I hand-optimized the inner loop:
int i;
uint x0 = input[0+ inputOffset];
uint x1 = input[1 + inputOffset];
uint x2 = input[2 + inputOffset];
uint x3 = input[3 + inputOffset];
uint x4 = input[4 + inputOffset];
uint x5 = input[5 + inputOffset];
uint x6 = input[6 + inputOffset];
uint x7 = input[7 + inputOffset];
uint x8 = input[8 + inputOffset];
uint x9 = input[9 + inputOffset];
uint x10 = input[10 + inputOffset];
uint x11 = input[11 + inputOffset];
uint x12 = input[12 + inputOffset];
uint x13 = input[13 + inputOffset];
uint x14 = input[14 + inputOffset];
uint x15 = input[15 + inputOffset];
for (i = rounds; i > 0; i -= 2)
{
x4 ^= R(x0 + x12, 7); x8 ^= R(x4 + x0, 9);
x12 ^= R(x8 + x4, 13); x0 ^= R(x12 + x8, 18);
x9 ^= R(x5 + x1, 7); x13 ^= R(x9 + x5, 9);
x1 ^= R(x13 + x9, 13); x5 ^= R(x1 + x13, 18);
x14 ^= R(x10 + x6, 7); x2 ^= R(x14 + x10, 9);
x6 ^= R(x2 + x14, 13); x10 ^= R(x6 + x2, 18);
x3 ^= R(x15 + x11, 7); x7 ^= R(x3 + x15, 9);
x11 ^= R(x7 + x3, 13); x15 ^= R(x11 + x7, 18);
x1 ^= R(x0 + x3, 7); x2 ^= R(x1 + x0, 9);
x3 ^= R(x2 + x1, 13); x0 ^= R(x3 + x2, 18);
x6 ^= R(x5 + x4, 7); x7 ^= R(x6 + x5, 9);
x4 ^= R(x7 + x6, 13); x5 ^= R(x4 + x7, 18);
x11 ^= R(x10 + x9, 7); x8 ^= R(x11 + x10, 9);
x9 ^= R(x8 + x11, 13); x10 ^= R(x9 + x8, 18);
x12 ^= R(x15 + x14, 7); x13 ^= R(x12 + x15, 9);
x14 ^= R(x13 + x12, 13); x15 ^= R(x14 + x13, 18);
}
output[0 + outputOffset] = x0 + input[0 + inputOffset];
output[1 + outputOffset] = x1 + input[1 + inputOffset];
output[2 + outputOffset] = x2 + input[2 + inputOffset];
output[3 + outputOffset] = x3 + input[3 + inputOffset];
output[4 + outputOffset] = x4 + input[4 + inputOffset];
output[5 + outputOffset] = x5 + input[5 + inputOffset];
output[6 + outputOffset] = x6 + input[6 + inputOffset];
output[7 + outputOffset] = x7 + input[7 + inputOffset];
output[8 + outputOffset] = x8 + input[8 + inputOffset];
output[9 + outputOffset] = x9 + input[9 + inputOffset];
output[10 + outputOffset] = x10 + input[10 + inputOffset];
output[11 + outputOffset] = x11 + input[11 + inputOffset];
output[12 + outputOffset] = x12 + input[12 + inputOffset];
output[13 + outputOffset] = x13 + input[13 + inputOffset];
output[14 + outputOffset] = x14 + input[14 + inputOffset];
output[15 + outputOffset] = x15 + input[15 + inputOffset];
EDIT: Oops, got it wrong the first time, corrected. Inlining doesn't get you much. Still, x2 faster than original implementation.