After trying to include the K values into the W calculations, I ran into some trouble... I am not getting how to do it. Lets say we are first calculating W20 which = (constant1 + s0) after this we can calculate W22 = (constant2 + s0). Now, if we include the K calculation into the W calculation, W20 = (constant1 + K20 + s0) = (constant3 + s0). We now try to calculate W22 (constant1 + s0), but since s0 for W22 is based on W20 without the K addition, we need somehow to get that value, which I would say that you can store W20 = (constant1 + s0) and W20K = (W20 + K20), but this uses the same # of ops as doing it in the Round plus uses more storage. Let me know if I misunderstood you or am missing something.
I don't quite understand the ch() optimization because according to my calculations, e4 is variable while f4 and g4 are fixed, so ch(e4, f4, g4) doesn't fit into the pattern, which requires e to be one of the two fixed inputs.
oops, I mixed up the maj and ch operations... ch() takes 1 instruction and maj(x,y,z) = ch(x^y, z, y)
in mod2 arithmetic (and = * and xor = +)
maj(x, y, z) = (xy + xz + yz), you can see that the order does not matter in the maj operation and therefore any 2 operands can be constant to save an instruction
About the values I got for the W calcs, I am off.. I was mixing things up and your values are correct and what I use in my kernel (I was mixing up which W values have the rots and xors)
One more thing, I have been messing with AMD KernelAnalyzer and it's helpful to count the actual lines in the disassembler listing, which represent individual scalar ops, to determine whether the compiler is issuing 5 ops per instruction, because sometimes it gets maybe 4.7 on average, or 4.9, depending. I didn't find the VLIW 'efficiency' reported anywhere so I had to count it myself. I believe the reason "VECTORS" improves performance so much is that with two effectively independent calculations it has more scheduling flexibility and averages more ops per instruction (closer to 5.0).
Thanks for the tip, In total, my kernel uses 4.946 ops per instruction (EDIT: after some minor code tweaks, it is now 4.973), so not really much room for improvement there.
EDIT AGAIN: The number of operations total (in my actual kernel) per nonce is 3359 including 5 operations (10 ops for 2 nonces: basically, getting the thread#) needed to get the nonce for the current thread.. so, 3354 for actually checking each Nonce. So, it seems pretty close to the actual limit...
Add in 8 values: 1 vs. naive 8 (saves 7)
Theoretically you don't need that last add either:
if(h64 + state[7] == 0) // Yay! Money!
optimized to -> if(h64 == -state[7]) // Yay! Money!
Where state[7] is constant. So you save 8 ops, instead of just 7. I don't know if this is applicable to GPUs specifically (perhaps it's cheaper to add than compare to 0!?), but it's an optimization in general.
Yup
, there's still a compare instruction, but that can be mitigated by combining the -(K and H) into a single constant
the end of my kernel looks like:
v = W[64 + 53] + W[64 + 44] + Vals[3] + Vals[7] + P2(64 + 60) + P1(64 + 60) + Ch((Vals[0] + Vals[4]) + (K[59] + W(59+64)) + s1(64+59)+ ch(59+64),Vals[1],Vals[2]);
g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(59+64)) + s1(64+59)+ ch(59+64));
if (v == g)...
where P1 is s0 for the W calc and P2 is s1.
Vals[0]-Vals[7] are a-h
the compiler optimizes the constants together and effectively removes the K addition