I'm surprised that 64-bit rotate works as well as it does; perhaps it's newer drivers. The chi step in Keccak is slow, though, and while this doesn't matter for performance, what in the flying fuck is this?
uint tmp = N >> 1;
/* Determine the Nfactor */
while ((tmp & 1) == 0) {
tmp >>= 1;
Nfactor++;
}
That shit just bugs me. It's far simpler to do this:
I'm guessing because looping and incrementing is easier to grok than bitwise comparison? Heck, looking at what you've written, I had to do the manual look-and-see to even know that it comes up with the same answer, but then again, C is not my native language. Today I learned the CLZ function... wouldn't a simpler formula:
or even
give the same result as flipping the number and bitwise-anding it?