Wow, excellent post, your analysis has inspired me.
A couple things to add to your analysis is that ch() requires 2 instructions.
So, 1 Round = 19 ops
As for the redundancy of the inputs:
for the first SHA, the following are completely redundant
W16
W17
Round0
Round1
Round2
Totals savings = 85 instructions
Also, the following rounds have partial redundancy (numbers are required ops)
Round3 - 2ops
W18 3.5 ops (2 nonces share all but 1 bit so calculations can be shared)
W19 1 op
W20 6 ops
W21 6 ops
W22 6 ops
W23 7 ops
W24 7 ops
W25 11 ops
W26 11 ops
W27 11 ops
W28 11 ops
W28 11 ops
W29 11 ops
W30 11 ops
W31 11 ops
W32 11 ops
(1 round @ 19 and 15 W calcs @ 14 naive = 229 instructions)
Totals savings = 93.5 instructions
For the second SHA, the following have partial redundancy:
Round0 1 op
W16 2 ops
W17 2 ops
W18 8 ops
W19 8 ops
W20 8 ops
W21 8 ops
W22 5 ops
W23 13 ops
W24 11 ops
W25 11 ops
W26 11 ops
W27 11 ops
W28 11 ops
W29 11 ops
W30 11 ops
W31 13 ops
(1 round @ 19 and 16 W calcs at 14 naive = 243 instructions)
Totals savings = 106 instructions
The last 3 rounds and W calculations can be skipped at the end (If you only want to check if the least significant DWORD is 0) also, the 4th last e calc can be skipped
Total Savings = 100 instructions
As you said, "adding the 8 values to the running totals at the end of the first hash could be merged into the K values for the second hash" and only the H value needs to be known for mining, so we can skip 7 of the additions after the second hash.
Total Savings = 15 instructions
Finally, the Ch() calculation can be reduced by 1 for each of rounds 4 and 5 of the first SHA and for rounds 1 and 2 of the second SHA since ch(x,y,z) = maj(x^y, z, y) and x and y are constant for those rounds
Total Savings = 4 instructions
Grand Total Savings due to redundancy = 404 instructions
So, doing the calcs again, 14x48 + 19x64 + 8 = 1896 per SHA
1896 * 2 = 3792 - 404 = 3388 instructions...
The latest version of phatk uses 1358 groups (5 instructions each), 4 of which are only executed when a nonce is found which leaves 1354 * 5 = 6770 for 2 nonces = 3385 instructions total per nonce.
I know I must have left something out since I haven't even implemented all of this and I know my kernel is not 100% efficient, so if anyone see any more possible redundancy, let me know and I will try to add it to the kernel.
The biggest part of your post which has made me think is the k values being added to the W values and H Values (could be especially helpful when the W calculation already has a constant).
If this helps my kernel, expect a donation.
@Soros
AMD uses an architecture that executes instructions in groups of 5 (VLIW5). It sounds better for them to say it has 720 stream processors which execute a single instructions rather than say it has 144 stream processors which execute groups of 5 instructions which is technically more accurate.