-- Optimized poclbm kernel! Another 5 Mhash/s --

server

legendary

Activity: 892

Merit: 1002

1 BTC =1 BTC

Quote from: Diapolo on July 03, 2011, 02:39:53 AM

What were your values with the stock kernel, if I may ask?

Dia

I use this:

phoenix.exe -u ... -k poclbm VECTORS BFI_INT FASTLOOP=false AGGRESSION=11 DEVICE=0

(long term rejection rate is between 1-1.5%)

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: server on July 02, 2011, 06:38:10 PM

Quote from: Vince on July 02, 2011, 06:17:47 PM

Did anyone even try this version?? Huh

I was waiting for results from HD5xxx and NVIDIA owners!

Yup, but sorry... I tried your kernel on 5870 and Mhash/s went down from 392 (Dia's kernel) to 374.

What were your values with the stock kernel, if I may ask?

Dia

Anibalayl

newbie

Activity: 7

Merit: 0

interesting

server

legendary

Activity: 892

Merit: 1002

1 BTC =1 BTC

Quote from: Vince on July 02, 2011, 06:17:47 PM

Did anyone even try this version?? Huh

I was waiting for results from HD5xxx and NVIDIA owners!

Yup, but sorry... I tried your kernel on 5870 and Mhash/s went down from 392 (Dia's kernel) to 374.

Vince

newbie

Activity: 38

Merit: 0

Did anyone even try this version?? Huh

I was waiting for results from HD5xxx and NVIDIA owners!

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: fascistmuffin on July 02, 2011, 12:45:39 AM

I'm confused by change #4.

The else if should save a few operations just in itself, and if the code already runs correctly with the else if in it, then a double assignment into arrays would be costly if allowed to run.

If / else statements (control flow) in OpenCL kernels slow down computation speed always. Both paths need to be examined so it should make only a small or no difference to use if else or if if.

Dia

Diapolo

hero member

Activity: 772

Merit: 500

Seems like you used some similar ideas that I had for phatk

.

Look here: http://forum.bitcoin.org/index.php?topic=25135.msg314520#msg314520

Dia

bitless

newbie

Activity: 28

Merit: 0

Well, I meant I haven't tested my min() for long enough... and I probably won't test it at all because the difference is not worth the effort (well, I'll test together with other kernel mods, if I have any).

I haven't tried your changes yet. I honestly don't understand why it works with the local for anyone, but I like the constant thing you've done

Vince

newbie

Activity: 38

Merit: 0

I tested this one on the pools (even with __local) without any problems - and generated a block on testnet.

bitless

newbie

Activity: 28

Merit: 0

Yeah, the min() seems to help, but it helps so little that I can't see the difference without a profiler

And since I haven't tested the change nearly enough...

Vince

newbie

Activity: 38

Merit: 0

Quote from: bitless on July 02, 2011, 12:58:50 AM

Actually, *with* the min() used like I said earlier, the kernel compiles into something quite a lot shorter... I'm gonna *test* it overnight and claim the donations *if and only it works* unless you want to test it

For the local, search the board.

EDIT - i'll claim the donations for the min anyways, if (and only if) it works and helps anyone

Seen the issues on __local, removed it from the zip.

The min() has exactly the same speed here - maybe you can get it faster?

fascistmuffin

newbie

Activity: 56

Merit: 0

If H.x == 0 is almost always false, flip the if...else statement:

Code:

if (H.y == 0)
{
output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
}
else if (H.x == 0)
{
output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
}

It'd be faster than a double if statement since it'd be a single comparison in most cases.

bitless

newbie

Activity: 28

Merit: 0

Actually, *with* the min() used like I said earlier, the kernel compiles into something quite a lot shorter... I'm gonna *test* it overnight and claim the donations *if and only it works* unless you want to test it

For the local, search the board.

EDIT - i'll claim the donations for the min anyways, if (and only if) it works and helps anyone

Vince

newbie

Activity: 38

Merit: 0

Code:

if (H.x == 0)
   {
   output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
}
  else if (H.y == 0)
{
   output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
}

The first condition H.x == 0 if is almost always false, so its 2 comparisons almost every cycle, exactly the same speed as without "else"

The assignments are only done when a hash is found, this does not affect speed at all.

Its not a double assignment, the second one goes to output[nonce.y & OUTPUT_MASK], thats just to get around race conditions.

The __local-patch is invalid? First time i heard of it .. Sure, I'll remove it then. The other ones produce valid hashes, I see no reason why they sould not be valid. Its all calculated on paper, step by step, looks equal to me.

Thanks for the donation!

fascistmuffin

newbie

Activity: 56

Merit: 0

I'm confused by change #4.

The else if should save a few operations just in itself, and if the code already runs correctly with the else if in it, then a double assignment into arrays would be costly if allowed to run.

bitless

newbie

Activity: 28

Merit: 0

I've sent you a small donation for your hard work, but...

Do pools accept any hashes generated with your kernel? Really? For instance, the 'local' optimization was declared invalid (it messes up the calculation, so the thread got locked by the moderator), etc.

EDIT - As to why exit early on the if()-s... well, if you found a solution already, why do you need a second solution? Doing branches on the GPU is very expensive (threads may diverge, etc.), so two branches may and most likely will end up being worse than one. May I suggest if(min(x,y)==0) { output x; }? Assuming min can be done without branching, this is one branch if you don't have a solution in either x or y (if min is not 1 instruction, find another function to replace the min...), then try both x and (x+1) on the CPU side to figure out which one of these is the real solution.

Vince

newbie

Activity: 38

Merit: 0

Want some more Mhash/s? Try this optimized kernel!

Tested with Phoenix miner - got my HD6950, stock speed, locked shaders, from 343Mhash/s to 349Mhash/s!

This kernel also contains the optimization already posted on this forum - namely "Ma z^x", this is not mine and I'm not taking credit for it! 343Mhash/s already contained this patch.

Whats new:
Lots of small changes, some only save a single addition.

Code:

#1:
Before:
H = 0xb0edbdd0 + K[ 0] + W0; D = 0xa54ff53a + H; H = H + 0x08909ae5U;

After:
H = W0 + 4228417613; D = W0 + 2563236514;

#2:
Before:
D = D1 + (rotr(A, 6) ^ rotr(A, 11) ^ rotr(A, 25)) + Ch(A, B1, C1) + K[ 4] + 0x80000000;

After:
D = D1 + (rotr(A, 6) ^ rotr(A, 11) ^ rotr(A, 25)) + Ch(A, B1, C1);
+ Put Constant K[ 4] + 0x80000000 into python pre-calculation
-> self.state2[3] = np.uint32(self.state2[3]+3109470811);

#3:
Before:
H = .... K[60] + W12;
H+=0x5be0cd19U;
if (H == 0)

After:
if (H == 325071597)

#4:
Before:
   if (H.x == 0)
   {
   output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
   }
   else if (H.y == 0)
   {
   output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
   }

After:
   if (H.x == 0)
   {
   output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
   }
   if (H.y == 0)
   {
   output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
   }

Why abort checking if we found a result? Unlikely, but we could have found two: This adds almost no overhead.

#5:
Lots of small changes (some of them were optimized by the compiler before, but anyway)

For #2 I changed the precalculation in __init__.py. Take a look at them! You can use diff - its just 2 lines.

Please note: This is part of the result of >100 hours hard work. If you want me to post keep posting patches, say thank you in form of a small donation. Everything above 0.01 is just fine Wink

-> 1Dsxro7GvNDaxWkvMgkraEttAA4xqagxVp

Btw, I already got some more - minor - optimizations.

Here is it:
http://www.filesonic.com/file/1348177284/poclbm_kernel.zip

Please post some results!

Topic: -- Optimized poclbm kernel! Another 5 Mhash/s -- (Read 2126 times)