Author

Topic: -- Optimized poclbm kernel! Another 5 Mhash/s -- (Read 2112 times)

legendary
Activity: 892
Merit: 1002
1 BTC =1 BTC
What were your values with the stock kernel, if I may ask?

Dia

I use this:

phoenix.exe -u ... -k poclbm VECTORS BFI_INT FASTLOOP=false AGGRESSION=11 DEVICE=0

(long term rejection rate is between 1-1.5%)

hero member
Activity: 769
Merit: 500
Did anyone even try this version??  Huh

I was waiting for results from HD5xxx and NVIDIA owners!

Yup, but sorry... I tried your kernel on 5870 and Mhash/s went down from 392 (Dia's kernel) to 374.

What were your values with the stock kernel, if I may ask?

Dia
newbie
Activity: 7
Merit: 0
interesting
legendary
Activity: 892
Merit: 1002
1 BTC =1 BTC
Did anyone even try this version??  Huh

I was waiting for results from HD5xxx and NVIDIA owners!

Yup, but sorry... I tried your kernel on 5870 and Mhash/s went down from 392 (Dia's kernel) to 374.
newbie
Activity: 38
Merit: 0
Did anyone even try this version??  Huh

I was waiting for results from HD5xxx and NVIDIA owners!
hero member
Activity: 769
Merit: 500
I'm confused by change #4.

The else if should save a few operations just in itself, and if the code already runs correctly with the else if in it, then a double assignment into arrays would be costly if allowed to run.

If / else statements (control flow) in OpenCL kernels slow down computation speed always. Both paths need to be examined so it should make only a small or no difference to use if else or if if.

Dia
hero member
Activity: 769
Merit: 500
Seems like you used some similar ideas that I had for phatk Smiley.

Look here: http://forum.bitcoin.org/index.php?topic=25135.msg314520#msg314520

Dia
newbie
Activity: 28
Merit: 0
Well, I meant I haven't tested my min() for long enough... and I probably won't test it at all because the difference is not worth the effort (well, I'll test together with other kernel mods, if I have any).

 I haven't tried your changes yet. I honestly don't understand why it works with the local for anyone, but I like the constant thing you've done Smiley
newbie
Activity: 38
Merit: 0
I tested this one on the pools (even with __local) without any problems - and generated a block on testnet.
newbie
Activity: 28
Merit: 0
Yeah, the min() seems to help, but it helps so little that I can't see the difference without a profiler Smiley And since I haven't tested the change nearly enough...
newbie
Activity: 38
Merit: 0
Actually, *with* the min() used like I said earlier, the kernel compiles into something quite a lot shorter... I'm gonna *test* it overnight and claim the donations *if and only it works* unless you want to test it Smiley

For the local, search the board.

EDIT - i'll claim the donations for the min anyways, if (and only if) it works and helps anyone Smiley

Seen the issues on __local, removed it from the zip.

The min() has exactly the same speed here - maybe you can get it faster?
newbie
Activity: 56
Merit: 0
If H.x == 0 is almost always false, flip the if...else statement:

Code:
if (H.y == 0)
{
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
}
else if (H.x == 0)
{
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
}

It'd be faster than a double if statement since it'd be a single comparison in most cases.
newbie
Activity: 28
Merit: 0
Actually, *with* the min() used like I said earlier, the kernel compiles into something quite a lot shorter... I'm gonna *test* it overnight and claim the donations *if and only it works* unless you want to test it Smiley

For the local, search the board.

EDIT - i'll claim the donations for the min anyways, if (and only if) it works and helps anyone Smiley
newbie
Activity: 38
Merit: 0
Code:
if (H.x == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
}
  else if (H.y == 0)
{
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
}

The first condition H.x == 0 if is almost always false, so its 2 comparisons almost every cycle, exactly the same speed as without "else"

The assignments are only done when a hash is found, this does not affect speed at all.

Its not a double assignment, the second one goes to output[nonce.y & OUTPUT_MASK], thats just to get around race conditions.

The __local-patch is invalid? First time i heard of it .. Sure, I'll remove it then. The other ones produce valid hashes, I see no reason why they sould not be valid. Its all calculated on paper, step by step, looks equal to me.

Thanks for the donation!
newbie
Activity: 56
Merit: 0
I'm confused by change #4.

The else if should save a few operations just in itself, and if the code already runs correctly with the else if in it, then a double assignment into arrays would be costly if allowed to run.
newbie
Activity: 28
Merit: 0
I've sent you a small donation for your hard work, but...

Do pools accept any hashes generated with your kernel? Really? For instance, the 'local' optimization was declared invalid (it messes up the calculation, so the thread got locked by the moderator), etc.

EDIT - As to why exit early on the if()-s... well, if you found a solution already, why do you need a second solution? Doing branches on the GPU is very expensive (threads may diverge, etc.), so two branches may and most likely will end up being worse than one. May I suggest if(min(x,y)==0) { output x; }? Assuming min can be done without branching, this is one branch if you don't have a solution in either x or y (if min is not 1 instruction, find another function to replace the min...), then try both x and (x+1) on the CPU side to figure out which one of these is the real solution.



newbie
Activity: 38
Merit: 0
Want some more Mhash/s? Try this optimized kernel!

Tested with Phoenix miner - got my HD6950, stock speed, locked shaders, from 343Mhash/s to 349Mhash/s!

This kernel also contains the optimization already posted on this forum - namely "Ma z^x", this is not mine and I'm not taking credit for it! 343Mhash/s already contained this patch.

Whats new:
Lots of small changes, some only save a single addition.

Code:
#1:
Before:
H = 0xb0edbdd0 + K[ 0] +  W0; D = 0xa54ff53a + H; H = H + 0x08909ae5U;

After:
H = W0 + 4228417613; D = W0 + 2563236514;

#2:
Before:
D = D1 + (rotr(A, 6) ^ rotr(A, 11) ^ rotr(A, 25)) + Ch(A, B1, C1) + K[ 4] +  0x80000000;

After:
D = D1 + (rotr(A, 6) ^ rotr(A, 11) ^ rotr(A, 25)) + Ch(A, B1, C1);
+ Put Constant K[ 4] + 0x80000000 into python pre-calculation
-> self.state2[3] = np.uint32(self.state2[3]+3109470811);

#3:
Before:
H = ....   K[60] + W12;
H+=0x5be0cd19U;
if (H == 0)

After:
if (H == 325071597)

#4:
Before:
        if (H.x == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
        }
        else if (H.y == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
        }

After:
        if (H.x == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
        }
        if (H.y == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
        }

Why abort checking if we found a result? Unlikely, but we could have found two: This adds almost no overhead.

#5:
Lots of small changes (some of them were optimized by the compiler before, but anyway)


For #2 I changed the precalculation in __init__.py. Take a look at them! You can use diff - its just 2 lines.

Please note: This is part of the result of >100 hours hard work. If you want me to post keep posting patches, say thank you in form of a small donation. Everything above 0.01 is just fine Wink
-> 1Dsxro7GvNDaxWkvMgkraEttAA4xqagxVp

Btw, I already got some more - minor - optimizations.

Here is it:
http://www.filesonic.com/file/1348177284/poclbm_kernel.zip


Please post some results!
Jump to: