Pages:
Author

Topic: further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13 - page 10. (Read 51245 times)

legendary
Activity: 1344
Merit: 1004
Came across this topic when browsing https://en.bitcoin.it/wiki/Mining_hardware_comparison and the comment on the last 5830 entry (not the crossfire one). Here's my results

Baseline: 11.6 drivers. Not sure what SDK it is (I'm going to assume 2.4. I haven't done anything to my knowledge to change SDK and it seems 2.4 is what comes with 11.6) 5870 @ 1015 core, 300 memory with original phatk with phoenix 1.5 VECTORS BFI_INT WORKSIZE=128 FASTLOOP=false AGGRESSION=13
441Mhash
7-07 build, immediate gain to 450 MHash/sec
Increased memory to 350. 459 Mhash/sec
Increase WORKSIZE to 256. 463 MHash/sec

And I found that any increases to memory after 360 cause weird things to happen (almost certain crash, I panicked to get it back to 350 before it crashed), but I see people posting about 500MHz, so I'm going to try that out, hopefully not crash.

Edit: 500 memory speed causes a -decrease- to 452 MHash/sec. It leads me to think that there are certain dividers/timings being changed at certain thresholds. This would explain why 500MHz and 350 MHz appears to be stable, but 360+ and 600MHz are unstable (I was limited to 600MHz for a while because I didn't know how to push memory lower than 600. I used MSI Afterburner).

Now I use MSI Afterburner to alter voltage and AMD GPU Clock Tool to set frequencies. I don't suppose anyone knows an all-in-one solution? AMD GPU Clock Tool doesn't seem to want to set custom voltages. It just has 0.9500, 1.0630, and Max VDCC. It won't accept custom numbers.

member
Activity: 98
Merit: 10
Iv found 500mhz to be best for memory clock for gddr5 cards and 800mhz for gddr3 cards. Runs a bit hotter then 350 but is worth it imo.
newbie
Activity: 54
Merit: 0
I've tested your patch, and there are some great improvements from the original phoenix 1.50 miner indeed.
My 6950 gets improvement.
My 5870 gets improvement.
My 5850 gets improvement.

But I have noticed a regression with my 6870 (1005mhz core / 200 mhz mem)

Configuration:
-Using window 7
-Using Catalyst 11.6
-Using the aoclbf 1.74 frontend for phoenix 1.50
phatk
Vector
BFI_INT
Aggression 13
Work size 128

2011-07-03 kernel: 317 mhash/s (all 3 number are peak value)
2011-07-06 kernel: 283
2011-07-07 kernel: 283

...Needless to say, something bad happened between 07-03 and 07-06. I hope we can get to the bottom of this. If you need me to test something, I will be happy to do what I can for you.

Try setting worksize to 256 and upping your memory to 350mhz. I get the most mhash/sec using that.
newbie
Activity: 28
Merit: 0
Yea, I was thinking the same, but then I thought that there may be some smart way to rearrange some of the SHA-256 algorithm to change simple bit shifts, exclusive-ors and addition into more complex multiplies, maybe by carrying out two or three operations at once. But it would require a deep knowledge of the SHA-256 algorithm and binary maths. Something along the lines of way that Laplace can be used to solve differential equations by transferring everything into the S-domain, solve with addition and subtraction and then transfer back for the answer.

While not a math guru, I am certain this can't be done. The algorithm uses these kinds of operations:

- rotate or shift bits right by three different numbers of places, then XOR together
- select bits from one of two values, depending on bits in a third
- majority of bits set/clear in three values
- addition of the result of these operations and constant values for each round

you can build multiplication out of these operations if combined in a certain way, but the SHA-256 algorithm does not use them like this. If (parts of) SHA were equivalent to something as simple as multiplication, I'd say it could be broken in no time.

Also, SHA256 uses 32 bit values for everything. You could of course implement it on an 8 bit machine, but this would make it much slower. And having 24 bit wide registers does not even mean you could run three 8 bit ops at the same time

Another thought I had:
Is aggressive loop unrolling really helping performance? At least for FPGAs, I guess that lots of very small units that maybe do one hash every 64 clock cycles could be better than a much bigger unrolled design, and the same could be true for GPUs. Was this already tested or did everyone start with the assumption that unrolling is the way to go?
full member
Activity: 126
Merit: 100
... snip ...
But here lies the problem, AFAIK there are only additions, bit-shiftig and other bit-wise operations used for current kernel (no multiplications). So there should be no use for it on the first sight.
Yea, I was thinking the same, but then I thought that there may be some smart way to rearrange some of the SHA-256 algorithm to change simple bit shifts, exclusive-ors and addition into more complex multiplies, maybe by carrying out two or three operations at once. But it would require a deep knowledge of the SHA-256 algorithm and binary maths. Something along the lines of way that Laplace can be used to solve differential equations by transferring everything into the S-domain, solve with addition and subtraction and then transfer back for the answer.
newbie
Activity: 50
Merit: 0
I've tested your patch, and there are some great improvements from the original phoenix 1.50 miner indeed.
My 6950 gets improvement.
My 5870 gets improvement.
My 5850 gets improvement.

But I have noticed a regression with my 6870 (1005mhz core / 200 mhz mem)

Configuration:
-Using window 7
-Using Catalyst 11.6
-Using the aoclbf 1.74 frontend for phoenix 1.50
phatk
Vector
BFI_INT
Aggression 13
Work size 128

2011-07-03 kernel: 317 mhash/s (all 3 number are peak value)
2011-07-06 kernel: 283
2011-07-07 kernel: 283

...Needless to say, something bad happened between 07-03 and 07-06. I hope we can get to the bottom of this. If you need me to test something, I will be happy to do what I can for you.
hero member
Activity: 769
Merit: 500
I've been toying with an idea, but I don't have the necessary programming skills (or knowledge of the SHA-256 algorithm) to implement anything.

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf
Quote
41. What is the difference between 24-bit and 32-bit integer operations?

    24-bit operations are faster because they use floating point hardware and can execute on all compute unts. Many 32-bit integer operations also run on all stream processors, but if both a 24-bit and a 32-bit version exist for the same  instruction, the 32-bit instruction executes only one per cycle.

43. Do 24-bit integers exist in hardware?

    No, there are 24-bit instructions, such as MUL24/MAD24, but the smallest integer in hardware registers is 32-bits.

75. Is it possible to use all 256 register in a thread?

    No, the compiler limits a wavefront to half of the register pool, so there can always be at least two wavefronts executing in parallel.
http://developer.amd.com/sdks/amdappsdk/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Page 4-62
Quote
24-bit integer MULs and MADs have five times the throughput of 32-bit integer multiplies. 24-bit unsigned integers are natively supported only on the Evergreen family of devices and later. Signed 24-bit integers are supported only on the Northern Island family of devices and later. The use of OpenCL built-in functions for mul24 and mad24 is encouraged. Note that mul24 can be useful for array indexing operations.

http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=144722
Quote
On the 5800 series, signed mul24(a,b) is turned into
Code:
(((a<<8)>>8)*((b<<8)>>8))
. This makes it noticeably SLOWER than simply using a*b. Unsigned mul24(a,b) uses a native function. mad24 is similar. I made some kernels which just looped the same operation over and over:
signed a * b: 0.9736s
unsigned mul24(a,b): 0.9734s
signed mul24(a,b): 2.2771s

So anyhow what I was thinking was the following

Current  kernel: 1 * 256 bit hash / 32int =  8 32bit operations (speed 100% )
Possible Kernel: 3 * 256 bit hash / 24int = 32 24bit operations (speed a maximum of  166% [5 times faster divided by 3 SHA-256 operations in parallel])*

* It may actually end up being slower than the current kernel.cl if 32bit and 24bit operations are sent as wavefronts at the same time.

There may be some merit in trying to write a new kernel.cl that uses 32 x 24bit integers to carry out 3 parallel SHA-256 operations at once faster than one SHA-256 operation using 8 32bit integers .

But not everything can be carried out as 24bit operations, only mul24(a,b) and mad24(a,b), so the 166% speed up would only be achieved if every SHA-256 operation was covered by these two operations. The new kernel.cl would be limited to modern ATI hardware (54xx-59xx,67xx-69xx), which is generally what miners are using.

But to be honest I haven't looked into the SHA-256 algorithm, so I'm not sure if parts of it could ever be rewritten to utilise mad24(a,b) or mul24(a,b). But I like thinking outside the box.

Hi Bert, sorry for not directly answering you. I checked the OpenCL 1.1 specs and yes, there faster 24-Bit integer operations are mentioned, too.

mul24    (Fast integer function.) Multiply 24-bit integer values a and b
mad24    (Fast integer function.) Multiply 24-bit integer then add the 32-bit result to 32-bit integer

But here lies the problem, AFAIK there are only additions, bit-shiftig and other bit-wise operations used for current kernel (no multiplications). So there should be no use for it on the first sight.

Dia
newbie
Activity: 17
Merit: 0
newbie
Activity: 28
Merit: 0
Thank you for this improvements!
newbie
Activity: 22
Merit: 0
Thanks man this is awesome!  Hashing rate increased by 10 for each 5830 that i have. 

I'm at work right but i will definitely donate when i get home.
legendary
Activity: 1288
Merit: 1227
Away on an extended break
Thanks for the updates. Hashing rate increased like 1-2MH/s Grin
hero member
Activity: 769
Merit: 500
Guys, I introduced a small glitch, which produces an OpenCL compiler warning in version 07-07. For stability reasons please change line 77:

old:
u W[123];

new:
u W[124];

I missed sharound(123), which writes to W[123], which is undefined, because it's out of range. Sorry for that!
Will upload a fixed version shortly (only includes the change above and stays 07-07).

Edit:
Download 07-07 fixed: http://www.mediafire.com/?o7jfp60s7xefrg4

Dia
newbie
Activity: 20
Merit: 0
nice work, I'll try it  Wink
newbie
Activity: 4
Merit: 0
Could you raise your Mem clock to ~350 MHz and report back. What about Worksize of 256, for 5830 cards this helps a lot.

Played with mem clock speeds. 350 saw no improvement, but 600 to 1050 saw a ~.5 Mh/s improvement, but still not enough to get me back the 2 Mh/s I lost.

Work size of 256 dropped off another few Mh/s, so that definitely didn't help. It seems like 07/03/2011 is the winner for me! =)
Thanks for your efforts though, I'll definitely keep testing and see if the newer kernels can return to the 07/03/2011 level.
full member
Activity: 126
Merit: 100
... snip ...
Any ideas and hints are welcome.

Dia

I've been toying with an idea, but I don't have the necessary programming skills (or knowledge of the SHA-256 algorithm) to implement anything.

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf
Quote
41. What is the difference between 24-bit and 32-bit integer operations?

    24-bit operations are faster because they use floating point hardware and can execute on all compute unts. Many 32-bit integer operations also run on all stream processors, but if both a 24-bit and a 32-bit version exist for the same  instruction, the 32-bit instruction executes only one per cycle.

43. Do 24-bit integers exist in hardware?

    No, there are 24-bit instructions, such as MUL24/MAD24, but the smallest integer in hardware registers is 32-bits.

75. Is it possible to use all 256 register in a thread?

    No, the compiler limits a wavefront to half of the register pool, so there can always be at least two wavefronts executing in parallel.
http://developer.amd.com/sdks/amdappsdk/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Page 4-62
Quote
24-bit integer MULs and MADs have five times the throughput of 32-bit integer multiplies. 24-bit unsigned integers are natively supported only on the Evergreen family of devices and later. Signed 24-bit integers are supported only on the Northern Island family of devices and later. The use of OpenCL built-in functions for mul24 and mad24 is encouraged. Note that mul24 can be useful for array indexing operations.

http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=144722
Quote
On the 5800 series, signed mul24(a,b) is turned into
Code:
(((a<<8)>>8)*((b<<8)>>8))
. This makes it noticeably SLOWER than simply using a*b. Unsigned mul24(a,b) uses a native function. mad24 is similar. I made some kernels which just looped the same operation over and over:
signed a * b: 0.9736s
unsigned mul24(a,b): 0.9734s
signed mul24(a,b): 2.2771s

So anyhow what I was thinking was the following

Current  kernel: 1 * 256 bit hash / 32int =  8 32bit operations (speed 100% )
Possible Kernel: 3 * 256 bit hash / 24int = 32 24bit operations (speed a maximum of  166% [5 times faster divided by 3 SHA-256 operations in parallel])*

* It may actually end up being slower than the current kernel.cl if 32bit and 24bit operations are sent as wavefronts at the same time.

There may be some merit in trying to write a new kernel.cl that uses 32 x 24bit integers to carry out 3 parallel SHA-256 operations at once faster than one SHA-256 operation using 8 32bit integers .

But not everything can be carried out as 24bit operations, only mul24(a,b) and mad24(a,b), so the 166% speed up would only be achieved if every SHA-256 operation was covered by these two operations. The new kernel.cl would be limited to modern ATI hardware (54xx-59xx,67xx-69xx), which is generally what miners are using.

But to be honest I haven't looked into the SHA-256 algorithm, so I'm not sure if parts of it could ever be rewritten to utilise mad24(a,b) or mul24(a,b). But I like thinking outside the box.
hero member
Activity: 769
Merit: 500
Asus EAH5850, core 840, mem 180, volt 1080:

version 2011-07-01: 338mh/s
version 2011-07-03: 336mh/s
version 2011-07-06: 301mh/s
version 2011-07-07: 301mh/s

I'll try to play with core/mem clocks again, because this values was the optimals for the old phatk. Thanks.

I bet 0,1 BTC, that you will reach higher values, with raised mem clocks Cheesy. Deal?

Dia
hero member
Activity: 938
Merit: 501
Asus EAH5850, core 840, mem 180, volt 1080:

version 2011-07-01: 338mh/s
version 2011-07-03: 336mh/s
version 2011-07-06: 301mh/s
version 2011-07-07: 301mh/s

I'll try to play with core/mem clocks again, because this values was the optimals for the old phatk. Thanks.
hero member
Activity: 769
Merit: 500
Tested the latest 2011-07-07 kernel on my 6990 @ 880Mhz core using the latest 7/1 version of GUIMiner without any additional kernel tweaks and saw roughly a 15 Mh/s increase. Thanks and hope to see further improvements in hash rate to come Smiley.

It gets's harder after each new version, so I guess next version could take some time Smiley. Any ideas and hints are welcome.

Dia
newbie
Activity: 55
Merit: 0
Tested the latest 2011-07-07 kernel on my 6990 @ 880Mhz core using the latest 7/1 version of GUIMiner without any additional kernel tweaks and saw roughly a 15 Mh/s increase. Thanks and hope to see further improvements in hash rate to come Smiley.
hero member
Activity: 769
Merit: 500
New version 2011-07-07 works on SDK 2.1 w/ VECTORS.

Thanks

So how does it work for you? Compared to other kernels? Which cards do you use?

Dia
Pages:
Jump to: