Author

Topic: Theoretical limit on hashing speed (Read 2096 times)

member
Activity: 70
Merit: 18
August 03, 2011, 08:33:26 PM
#12
After trying to include the K values into the W calculations, I ran into some trouble...  I am not getting how to do it.  Lets say we are first calculating W20 which = (constant1 + s0) after this we can calculate W22 = (constant2 + s0).  Now, if we include the K calculation into the W calculation, W20 = (constant1 + K20 + s0) = (constant3 + s0).  We now try to calculate W22 (constant1 + s0), but since s0 for W22 is based on W20 without the K addition, we need somehow to get that value, which I would say that you can store W20 = (constant1 + s0) and W20K = (W20 + K20), but this uses the same # of ops as doing it in the Round plus uses more storage.  Let me know if I misunderstood you or am missing something.
Ah crap, you are right.  You can't save the addition W+K because you need the W alone to compute subsequent W.  Only for the fixed W can you precompute W and W+K, so I have to revise upward my estimate by 16 operations for the first hash (W[18] through W[33]) and by 24 for the second hash (W[0] through W[7] and W[16] through W[32]).  But the K values can still be merged with the h values for the first four rounds of the second hash, so W+K can be skipped for W[0] through W[3] after all.

As for ma() and ch(), I did a truth table to triple check and I think I finally understand it.  It looks like this can be applied in round 5 of the first hash and round 2 of the second hash, so that's 2 operations.

Removing the final add, the new grand total is:
First hash W values: 538 vs. naive 672  (saves 134)
First hash 64 rounds: 1138 vs. naive 1216  (saves 78)
Add in 8 values: 8  (saves 0)
Second hash W values: 575 vs. naive 672  (saves 97)
Second hash 64 rounds: 1109 vs. naive 1216  (saves 107)
Add in 8 values: 0 vs. naive 8   (saves Cool

Grand total: 3368 vs. naive 3792 ops per nonce.

newbie
Activity: 52
Merit: 0
August 03, 2011, 01:24:28 PM
#11
After trying to include the K values into the W calculations, I ran into some trouble...  I am not getting how to do it.  Lets say we are first calculating W20 which = (constant1 + s0) after this we can calculate W22 = (constant2 + s0).  Now, if we include the K calculation into the W calculation, W20 = (constant1 + K20 + s0) = (constant3 + s0).  We now try to calculate W22 (constant1 + s0), but since s0 for W22 is based on W20 without the K addition, we need somehow to get that value, which I would say that you can store W20 = (constant1 + s0) and W20K = (W20 + K20), but this uses the same # of ops as doing it in the Round plus uses more storage.  Let me know if I misunderstood you or am missing something.

Quote
I don't quite understand the ch() optimization because according to my calculations, e4 is variable while f4 and g4 are fixed, so ch(e4, f4, g4) doesn't fit into the pattern, which requires e to be one of the two fixed inputs.
oops, I mixed up the maj and ch operations... ch() takes 1 instruction and maj(x,y,z) = ch(x^y, z, y)
in mod2 arithmetic (and = * and xor = +)
maj(x, y, z) =  (xy + xz + yz), you can see that the order does not matter in the maj operation and therefore any 2 operands can be constant to save an instruction

About the values I got for the W calcs, I am off.. I was mixing things up and your values are correct and what I use in my kernel (I was mixing up which W values have the rots and xors)

Quote
One more thing, I have been messing with AMD KernelAnalyzer and it's helpful to count the actual lines in the disassembler listing, which represent individual scalar ops, to determine whether the compiler is issuing 5 ops per instruction, because sometimes it gets maybe 4.7 on average, or 4.9, depending.  I didn't find the VLIW 'efficiency' reported anywhere so I had to count it myself.  I believe the reason "VECTORS" improves performance so much is that with two effectively independent calculations it has more scheduling flexibility and averages more ops per instruction (closer to 5.0).

Thanks for the tip, In total, my kernel uses 4.946 ops per instruction (EDIT: after some minor code tweaks, it is now 4.973), so not really much room for improvement there.
EDIT AGAIN: The number of operations total (in my actual kernel) per nonce is 3359 including 5 operations (10 ops for 2 nonces: basically, getting the thread#) needed to get the nonce for the current thread.. so, 3354 for actually checking each Nonce.  So, it seems pretty close to the actual limit...

Quote
Add in 8 values: 1 vs. naive 8   (saves 7)
Theoretically you don't need that last add either:

Code:
if(h64 + state[7] == 0) // Yay! Money!
optimized to -> if(h64 == -state[7]) // Yay! Money!

Where state[7] is constant. So you save 8 ops, instead of just 7. I don't know if this is applicable to GPUs specifically (perhaps it's cheaper to add than compare to 0!?), but it's an optimization in general.


Yup Smiley, there's still a compare instruction, but that can be mitigated by combining the -(K and H) into a single constant

the end of my kernel looks like:
Code:
	v = W[64 + 53] + W[64 + 44] + Vals[3] + Vals[7] + P2(64 + 60) + P1(64 + 60) + Ch((Vals[0] + Vals[4]) + (K[59] + W(59+64)) + s1(64+59)+ ch(59+64),Vals[1],Vals[2]); 
g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(59+64))  + s1(64+59)+ ch(59+64));
        if (v == g)...
where P1 is s0 for the W calc and P2 is s1.
Vals[0]-Vals[7] are a-h
the compiler optimizes the constants together and effectively removes the K addition
member
Activity: 70
Merit: 18
August 03, 2011, 05:39:29 AM
#10
Theoretically you don't need that last add either:

Code:
if(h64 + state[7] == 0) // Yay! Money!
optimized to -> if(h64 == -state[7]) // Yay! Money!

Where state[7] is constant. So you save 8 ops, instead of just 7. I don't know if this is applicable to GPUs specifically (perhaps it's cheaper to add than compare to 0!?), but it's an optimization in general.

Good catch, that's true.  On some architectures, comparing equality is equivalent to subtracting and testing for zero, but I don't think the GPU is one of them.  Testing equality surely won't be slower.  If compare equal to literal instruction exists (and I think it does as "PRED_SETE") then it saves one add, and if not then it's the same.
hero member
Activity: 560
Merit: 517
August 03, 2011, 04:52:49 AM
#9
Quote
Add in 8 values: 1 vs. naive 8   (saves 7)
Theoretically you don't need that last add either:

Code:
if(h64 + state[7] == 0) // Yay! Money!
optimized to -> if(h64 == -state[7]) // Yay! Money!

Where state[7] is constant. So you save 8 ops, instead of just 7. I don't know if this is applicable to GPUs specifically (perhaps it's cheaper to add than compare to 0!?), but it's an optimization in general.
member
Activity: 70
Merit: 18
August 03, 2011, 04:42:52 AM
#8
Okay, I did a more detailed breakdown and here's what I got.

First, I think I made a mistake in combining the 8 adds at the end of the first hash into the K for round 2 because the W[0] through W[7] values need to be exact as input for the rest of the W, so they must have added the values from the end of the first hash and cannot include anything else.  But clearly 7 of them can be removed from the end of the second hash.


Before I continue, let me introduce a little bit of notation that I hope is not too confusing.  Let a1 denote the value of a after round 1, a2 the value of a after round 2, etc, and likewise for b, c, d, e, f, g, and h.  The initial values will be a0, b0, c0, ...  For the most part we'll be concerned with a1, e1, a2, e2, etc..

The first three rounds of the first hash can be skipped entirely, since W[0], W[1], and W[2] are fixed.  a1, e1, a2, e2, a3, and e3 are also fixed.

The fourth round of the first hash requires only two adds, since S0(a3), ma(), S1(e3), and ch() are all fixed, and so is h3 = e0.

The 5th, 6th and 7th round of the first hash can save one add, since h4, h5, and h6 are e2, e3, and e4 which are fixed.  They can be pre-added to w4, w5, and w6.  I don't quite understand the ch() optimization because according to my calculations, e4 is variable while f4 and g4 are fixed, so ch(e4, f4, g4) doesn't fit into the pattern, which requires e to be one of the two fixed inputs.

I'm not seeing anything else in the first hash, so we have a total of 19x3 + 17 + 1 + 1 + 1 = 77 operations saved, not including W calculations.


For the second hash, the first round requires 2 adds since everything is fixed except W[0].  Again I'm not sure about optimizing ch() in the second round since e1 is variable while f1 and g1 are fixed.  So that's 17 operations saved in only the first round.

Toward the end of the second hash there is a lot of savings.  I had mistakenly thought that a64 had to be zero, when in fact it's h64 which has to be zero.  The last computation that has to be done is e61, which as you said means rounds 62, 63, and 64 do not have to be computed at all, and W[61], W[62], and W[63] also do not have to be computed.

Furthermore, a58, a59, a60, and a61 do not have to be computed either, since they are only used to compute subsequent a values.  They do not affect e values until a57 eventually turns into d60 which is input to e61.  Therefore 8 operations (S0(), ma() and 2 adds) can be saved for rounds 58, 59, 60, and 61.

So for the second hash, I am showing 17 operations for the first round, and 19x3=57 for the last three and 8x4=32 for rounds 58-61.  That's a total of 106, plus the 3x14=42 for the W calculation that I hadn't included before.

So the grand total by my calculation is:
First hash W values: 522 vs. naive 672  (saves 150)
First hash 64 rounds: 1139 vs. naive 1216  (saves 77)
Add in 8 values: 8  (saves 0)
Second hash W values: 555 vs. naive 672  (saves 117)
Second hash 64 rounds: 1110 vs. naive 1216  (saves 106)
Add in 8 values: 1 vs. naive 8   (saves 7)

Grand total: 3335 vs. naive 3792 ops per nonce.

We've got some discrepancies, though I'm inclined to put a lot of faith into your actual implementation vs my scribbles on paper.  Eventually I think they ought to converge to the same number.

One more thing, I have been messing with AMD KernelAnalyzer and it's helpful to count the actual lines in the disassembler listing, which represent individual scalar ops, to determine whether the compiler is issuing 5 ops per instruction, because sometimes it gets maybe 4.7 on average, or 4.9, depending.  I didn't find the VLIW 'efficiency' reported anywhere so I had to count it myself.  I believe the reason "VECTORS" improves performance so much is that with two effectively independent calculations it has more scheduling flexibility and averages more ops per instruction (closer to 5.0).
member
Activity: 70
Merit: 18
August 03, 2011, 02:25:01 AM
#7
Oops, my mistake on thinking Ch() was one instruction.

I'm showing different numbers for the W optimization, but again my 'optimization' is on paper so it could have errors.  Here is how mine break down:

S2() denotes rot(W[x-2],15)^rot(W[x-2],13)^((W[x-2])>>10U)
S3() denotes rot(W[x-15],25)^rot(W[x-15],14)^((W[x-15])>>3U)

First hash:
W16: 0
W17: 0
W18: 2  because in general, S3(A xor B) = S3(A) xor S3(B) so if you compute more than 2 nonces in a kernel, then you can factor the nonce into an invariant A, and a variable B (say 0-63 for example), and you can pre-compute S3(B) as 64 compile-time constants and compute S3(A) once for the whole group.  Then you only need one xor and one add for W18.  This is basically a generalization of the LSB optimization.
W19: 1
W20: 6
W21: 6
W22: 6
W23: 6  to calculate S2(W[21]) which is 5 and adding it into W/K.  W[16] is fixed and can be pre-added into K.
W24: 6  (same as W23)
W25: 7  to calculate S2(W[23]) which is 5 and adding it, and also adding W[18].  S3(W[10]) is fixed, so can you elaborate on your 11 operations?
W26 - W32: same as W25
W33: 13 because W[17] is fixed and can be pre-added into K.
W34 to W63: 14

Total operations: 522 vs. naive 672 for savings of 150

For the second hash my calculations are quite different, so perhaps I'm misunderstanding how the computation is supposed to happen.  My thought was that the input was 16 uints, like this:  XXXXXXXXFFFFFFFF, in other words 8 values that are unknown and 8 values that are fixed.  W[0] through W[7] are unpredictable while W[8] through W[15] are constant.

Based on this, I'm getting
W16: 7  for S3(W[14]) and W[0] but skipping S2 and W[9]
W17: 7  for S3(W[15]) and W[1] but skipping S2(W[15]) and W[10]
W18: 13  everything but skipping adding W[11]
W19: 13  same as W18
W20: 13
W21: 13
W22: 13
W23: 8  everything but skipping S3(W[8])
W24: 7  S2(W[22]) and adding W[8], but skipping S3(W[9]) and adding W[8]
W25: 7  same as W24
W26: 7
W27: 7
W28: 7
W29: 7
W30: 7
W31: 13  everything except adding W[15]
W32 to W63: 14

Total operations: 597 vs naive of 672 for a savings of 75


I'm going to have to rethink the round savings, your post highlighted an error in my optimization so I have to re-do that part and I'll post specifics on that later.
member
Activity: 70
Merit: 18
August 03, 2011, 01:20:11 AM
#6
I have a 6750 which has 144 stream processors, and I clock it at 870 MHz.  Meaning it can perform 870M*144*5 = 626.4 billion alu operations per second.
On AMDs website the HD 6750 specs indicate that it has 720 stream processors. In your calcualtions, you are indicating only 144 stream processors, but later you are multiplying 144 by 5 when calculating the theoretical alu ops per sec.

Could you elaborate a little on the the difference in terminology?

Yes, I encountered this, and it confused me at first.  I should have said 144 stream cores.

Generally all the cards break down as follows:
1 GPU has multiple Compute Units
Each Compute Unit has 16 Stream Cores (appears to always be 16 on all their devices)
Each Stream Core has either 4 (69xx) or 5 (all the rest) processing elements, which enables it to execute multiple scalar operations in a single VLIW instruction.

Their site indeed says 720 stream processors but I think they mean processing elements, not stream cores.

In the programming guide in Appendix D they list the Juniper LE as having 144 Stream Cores and each Stream Core has 5 processing elements.  My card shows up as Juniper within GUI Miner.
newbie
Activity: 52
Merit: 0
August 02, 2011, 09:02:39 PM
#5
Wow, excellent post, your analysis has inspired me.

A couple things to add to your analysis is that ch() requires 2 instructions. 
So, 1 Round = 19 ops

As for the redundancy of the inputs:

for the first SHA, the following are completely redundant
W16
W17
Round0
Round1
Round2
Totals savings = 85 instructions

Also, the following rounds have partial redundancy (numbers are required ops)
Round3 - 2ops
W18 3.5 ops (2 nonces share all but 1 bit so calculations can be shared)
W19 1 op
W20 6 ops
W21 6 ops
W22 6 ops
W23 7 ops
W24 7 ops
W25 11 ops
W26 11 ops
W27 11 ops
W28 11 ops
W28 11 ops
W29 11 ops
W30 11 ops
W31 11 ops
W32 11 ops
(1 round @ 19 and 15 W calcs @ 14 naive = 229 instructions)
Totals savings = 93.5 instructions

For the second SHA, the following have partial redundancy:
Round0 1 op
W16 2 ops
W17 2 ops
W18 8 ops
W19 8 ops
W20 8 ops
W21 8 ops
W22 5 ops
W23 13 ops
W24 11 ops
W25 11 ops
W26 11 ops
W27 11 ops
W28 11 ops
W29 11 ops
W30 11 ops
W31 13 ops
(1 round @ 19 and 16 W calcs at 14 naive = 243 instructions)
Totals savings = 106 instructions

The last 3 rounds and W calculations can be skipped at the end (If you only want to check if the least significant DWORD is 0) also, the 4th last e calc can be skipped
Total Savings = 100 instructions

As you said, "adding the 8 values to the running totals at the end of the first hash could be merged into the K values for the second hash" and only the H value needs to be known for mining, so we can skip 7 of the additions after the second hash.
Total Savings = 15 instructions

Finally, the Ch() calculation can be reduced by 1 for each of rounds 4 and 5 of the first SHA and for rounds 1 and 2 of the second SHA since ch(x,y,z) = maj(x^y, z, y) and x and y are constant for those rounds

Total Savings = 4 instructions

Grand Total Savings due to redundancy = 404 instructions

So, doing the calcs again, 14x48 + 19x64 + 8 = 1896 per SHA
1896 * 2 = 3792 - 404 = 3388 instructions...

The latest version of phatk uses 1358 groups (5 instructions each), 4 of which are only executed when a nonce is found which leaves 1354 * 5 = 6770 for 2 nonces = 3385 instructions total per nonce.

I know I must have left something out since I haven't even implemented all of this and I know my kernel is not 100% efficient, so if anyone see any more possible redundancy, let me know and I will try to add it to the kernel.

The biggest part of your post which has made me think is the k values being added to the W values and H Values (could be especially helpful when the W calculation already has a constant).

If this helps my kernel, expect a donation. Cheesy


@Soros

AMD uses an architecture that executes instructions in groups of 5 (VLIW5).  It sounds better for them to say it has 720 stream processors which execute a single instructions rather than say it has 144 stream processors which execute groups of 5 instructions which is technically more accurate.
donator
Activity: 1617
Merit: 1012
August 02, 2011, 08:42:57 PM
#4
I have a 6750 which has 144 stream processors, and I clock it at 870 MHz.  Meaning it can perform 870M*144*5 = 626.4 billion alu operations per second.
On AMDs website the HD 6750 specs indicate that it has 720 stream processors. In your calcualtions, you are indicating only 144 stream processors, but later you are multiplying 144 by 5 when calculating the theoretical alu ops per sec.

Could you elaborate a little on the the difference in terminology?
hero member
Activity: 658
Merit: 500
August 02, 2011, 08:19:17 PM
#3
you realize people would pay good money to go up even 5%? Of course it gets harder and harder to optimize as we get closer to the theoretical limit
full member
Activity: 130
Merit: 100
August 02, 2011, 06:24:51 PM
#2
Nice Post !
member
Activity: 70
Merit: 18
August 02, 2011, 03:39:34 PM
#1
I was originally hoping to improve on the latest kernels, but it seems they are extremely close to the theoretical limit that the hardware can support.

My calculations are as follows:
For a 'naive' implementation that doesn't exploit the redundancy of the input, computing W[16] through W[63] takes:
6 rot
4 xor
4 add (3 plus one if you include adding in the K value)
This amounts to 14 ops times 48 values or 672 operations

Then for the rounds themselves the 'naive' or general algorithm requires
6 rot
4 xor
1 maj()
1 ch()
6 add (assuming W and K have already been added)
For a total of 18 ops times 64 rounds for a total of 1152 operations.

Then the 8 values have to be added to the running totals, which takes 8 more operations

The grand total for one SHA256 is 672 + 1152 + 8 = 1832 operations.  For a double-SHA256, it requires twice this, or 3664 operations.

On all but a few cards, each stream processor can perform up to 5 operations per clock cycle.  Depending on how smart the compiler is, it might get less than 5 operations per clock cycle.  On the 6950 and 6970, each stream processor can perform at most 4 operations per clock cycle.

I have a 6750 which has 144 stream processors, and I clock it at 870 MHz.  Meaning it can perform 870M*144*5 = 626.4 billion alu operations per second.

If I divide this throughput of 626.4 G by the 3664 operations required, I get 170.96 MH/s.  Right now I'm getting 165, using a kernel that's not particularly sophisticated.

On paper, using the redundancy of the input, I think I can find savings of 150 operations from first round W calculation, and 75 operations from the second round W calculation, and 46 more operations from the first few rounds of the first hash.  Adding the 8 values to the running totals at the end of the first hash could be merged into the K values for the second hash, and at the end of the second hash, they can be eliminated also.  And the final e value can be skipped.

The total savings for both rounds is 150 + 75 + 46 + 8 + 8 + 1 = 288, which brings the double-SHA256 from 3664 down to 3376, which is about a 7.9 percent savings.  I think the modern advanced kernels take advantage of all or nearly all of these savings.

This would mean my theoretical 626.4 billion operations would be limited to 185.5 MH/s.

This is good because it means the utilization is very good, and it's bad in that there is barely any room for improvement without going to different hardware.
Jump to: