further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13 - page 10.

ssateneth

legendary

Activity: 1344

Merit: 1004

Came across this topic when browsing https://en.bitcoin.it/wiki/Mining_hardware_comparison and the comment on the last 5830 entry (not the crossfire one). Here's my results

Baseline: 11.6 drivers. Not sure what SDK it is (I'm going to assume 2.4. I haven't done anything to my knowledge to change SDK and it seems 2.4 is what comes with 11.6) 5870 @ 1015 core, 300 memory with original phatk with phoenix 1.5 VECTORS BFI_INT WORKSIZE=128 FASTLOOP=false AGGRESSION=13
441Mhash
7-07 build, immediate gain to 450 MHash/sec
Increased memory to 350. 459 Mhash/sec
Increase WORKSIZE to 256. 463 MHash/sec

And I found that any increases to memory after 360 cause weird things to happen (almost certain crash, I panicked to get it back to 350 before it crashed), but I see people posting about 500MHz, so I'm going to try that out, hopefully not crash.

Edit: 500 memory speed causes a -decrease- to 452 MHash/sec. It leads me to think that there are certain dividers/timings being changed at certain thresholds. This would explain why 500MHz and 350 MHz appears to be stable, but 360+ and 600MHz are unstable (I was limited to 600MHz for a while because I didn't know how to push memory lower than 600. I used MSI Afterburner).

Now I use MSI Afterburner to alter voltage and AMD GPU Clock Tool to set frequencies. I don't suppose anyone knows an all-in-one solution? AMD GPU Clock Tool doesn't seem to want to set custom voltages. It just has 0.9500, 1.0630, and Max VDCC. It won't accept custom numbers.

bmgjet

member

Activity: 98

Merit: 10

Iv found 500mhz to be best for memory clock for gddr5 cards and 800mhz for gddr3 cards. Runs a bit hotter then 350 but is worth it imo.

SeriousWorm

newbie

Activity: 54

Merit: 0

Quote from: KKAtan on July 09, 2011, 12:26:21 PM

I've tested your patch, and there are some great improvements from the original phoenix 1.50 miner indeed.
My 6950 gets improvement.
My 5870 gets improvement.
My 5850 gets improvement.

But I have noticed a regression with my 6870 (1005mhz core / 200 mhz mem)

Configuration:
-Using window 7
-Using Catalyst 11.6
-Using the aoclbf 1.74 frontend for phoenix 1.50
phatk
Vector
BFI_INT
Aggression 13
Work size 128

2011-07-03 kernel: 317 mhash/s (all 3 number are peak value)
2011-07-06 kernel: 283
2011-07-07 kernel: 283

...Needless to say, something bad happened between 07-03 and 07-06. I hope we can get to the bottom of this. If you need me to test something, I will be happy to do what I can for you.

Try setting worksize to 256 and upping your memory to 350mhz. I get the most mhash/sec using that.

1MLyg5WVFSMifFjkrZiyGW2nw

newbie

Activity: 28

Merit: 0

Quote from: Bert on July 09, 2011, 01:18:15 PM

Yea, I was thinking the same, but then I thought that there may be some smart way to rearrange some of the SHA-256 algorithm to change simple bit shifts, exclusive-ors and addition into more complex multiplies, maybe by carrying out two or three operations at once. But it would require a deep knowledge of the SHA-256 algorithm and binary maths. Something along the lines of way that Laplace can be used to solve differential equations by transferring everything into the S-domain, solve with addition and subtraction and then transfer back for the answer.

While not a math guru, I am certain this can't be done. The algorithm uses these kinds of operations:

- rotate or shift bits right by three different numbers of places, then XOR together
- select bits from one of two values, depending on bits in a third
- majority of bits set/clear in three values
- addition of the result of these operations and constant values for each round

you can build multiplication out of these operations if combined in a certain way, but the SHA-256 algorithm does not use them like this. If (parts of) SHA were equivalent to something as simple as multiplication, I'd say it could be broken in no time.

Also, SHA256 uses 32 bit values for everything. You could of course implement it on an 8 bit machine, but this would make it much slower. And having 24 bit wide registers does not even mean you could run three 8 bit ops at the same time

Another thought I had:
Is aggressive loop unrolling really helping performance? At least for FPGAs, I guess that lots of very small units that maybe do one hash every 64 clock cycles could be better than a much bigger unrolled design, and the same could be true for GPUs. Was this already tested or did everyone start with the assumption that unrolling is the way to go?

Bert

full member

Activity: 126

Merit: 100

Quote from: Diapolo on July 09, 2011, 09:19:50 AM

... snip ...
But here lies the problem, AFAIK there are only additions, bit-shiftig and other bit-wise operations used for current kernel (no multiplications). So there should be no use for it on the first sight.

Yea, I was thinking the same, but then I thought that there may be some smart way to rearrange some of the SHA-256 algorithm to change simple bit shifts, exclusive-ors and addition into more complex multiplies, maybe by carrying out two or three operations at once. But it would require a deep knowledge of the SHA-256 algorithm and binary maths. Something along the lines of way that Laplace can be used to solve differential equations by transferring everything into the S-domain, solve with addition and subtraction and then transfer back for the answer.

KKAtan

newbie

Activity: 50

Merit: 0

I've tested your patch, and there are some great improvements from the original phoenix 1.50 miner indeed.
My 6950 gets improvement.
My 5870 gets improvement.
My 5850 gets improvement.

But I have noticed a regression with my 6870 (1005mhz core / 200 mhz mem)

Configuration:
-Using window 7
-Using Catalyst 11.6
-Using the aoclbf 1.74 frontend for phoenix 1.50
phatk
Vector
BFI_INT
Aggression 13
Work size 128

2011-07-03 kernel: 317 mhash/s (all 3 number are peak value)
2011-07-06 kernel: 283
2011-07-07 kernel: 283

...Needless to say, something bad happened between 07-03 and 07-06. I hope we can get to the bottom of this. If you need me to test something, I will be happy to do what I can for you.

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: Bert on July 08, 2011, 03:18:32 AM

I've been toying with an idea, but I don't have the necessary programming skills (or knowledge of the SHA-256 algorithm) to implement anything.

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf

Quote

41. What is the difference between 24-bit and 32-bit integer operations?

   24-bit operations are faster because they use floating point hardware and can execute on all compute unts. Many 32-bit integer operations also run on all stream processors, but if both a 24-bit and a 32-bit version exist for the same instruction, the 32-bit instruction executes only one per cycle.

43. Do 24-bit integers exist in hardware?

   No, there are 24-bit instructions, such as MUL24/MAD24, but the smallest integer in hardware registers is 32-bits.

75. Is it possible to use all 256 register in a thread?

   No, the compiler limits a wavefront to half of the register pool, so there can always be at least two wavefronts executing in parallel.

http://developer.amd.com/sdks/amdappsdk/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Page 4-62

Quote

24-bit integer MULs and MADs have five times the throughput of 32-bit integer multiplies. 24-bit unsigned integers are natively supported only on the Evergreen family of devices and later. Signed 24-bit integers are supported only on the Northern Island family of devices and later. The use of OpenCL built-in functions for mul24 and mad24 is encouraged. Note that mul24 can be useful for array indexing operations.

http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=144722

Quote

On the 5800 series, signed mul24(a,b) is turned into

Code:

(((a<<8)>>8)*((b<<8)>>8))

. This makes it noticeably SLOWER than simply using a*b. Unsigned mul24(a,b) uses a native function. mad24 is similar. I made some kernels which just looped the same operation over and over:
signed a * b: 0.9736s
unsigned mul24(a,b): 0.9734s
signed mul24(a,b): 2.2771s

So anyhow what I was thinking was the following

Current kernel: 1 * 256 bit hash / 32int = 8 32bit operations (speed 100% )
Possible Kernel: 3 * 256 bit hash / 24int = 32 24bit operations (speed a maximum of 166% [5 times faster divided by 3 SHA-256 operations in parallel])^*

^* It may actually end up being slower than the current kernel.cl if 32bit and 24bit operations are sent as wavefronts at the same time.

There may be some merit in trying to write a new kernel.cl that uses 32 x 24bit integers to carry out 3 parallel SHA-256 operations at once faster than one SHA-256 operation using 8 32bit integers .

But not everything can be carried out as 24bit operations, only mul24(a,b) and mad24(a,b), so the 166% speed up would only be achieved if every SHA-256 operation was covered by these two operations. The new kernel.cl would be limited to modern ATI hardware (54xx-59xx,67xx-69xx), which is generally what miners are using.

But to be honest I haven't looked into the SHA-256 algorithm, so I'm not sure if parts of it could ever be rewritten to utilise mad24(a,b) or mul24(a,b). But I like thinking outside the box.

Hi Bert, sorry for not directly answering you. I checked the OpenCL 1.1 specs and yes, there faster 24-Bit integer operations are mentioned, too.

mul24 (Fast integer function.) Multiply 24-bit integer values a and b
mad24 (Fast integer function.) Multiply 24-bit integer then add the 32-bit result to 32-bit integer

But here lies the problem, AFAIK there are only additions, bit-shiftig and other bit-wise operations used for current kernel (no multiplications). So there should be no use for it on the first sight.

Dia

gominoa

newbie

Activity: 17

Merit: 0

Works great!

Maxim Gladkov

newbie

Activity: 28

Merit: 0

Thank you for this improvements!

mitchel

newbie

Activity: 22

Merit: 0

Thanks man this is awesome! Hashing rate increased by 10 for each 5830 that i have.

I'm at work right but i will definitely donate when i get home.

John (John K.)

legendary

Activity: 1288

Merit: 1227

Away on an extended break

Thanks for the updates. Hashing rate increased like 1-2MH/s Grin

Diapolo

hero member

Activity: 772

Merit: 500

Guys, I introduced a small glitch, which produces an OpenCL compiler warning in version 07-07. For stability reasons please change line 77:

old:
u W[123];

new:
u W[124];

I missed sharound(123), which writes to W[123], which is undefined, because it's out of range. Sorry for that!
Will upload a fixed version shortly (only includes the change above and stays 07-07).

Edit:
Download 07-07 fixed: http://www.mediafire.com/?o7jfp60s7xefrg4

Dia

makiet

newbie

Activity: 20

Merit: 0

nice work, I'll try it Wink

zmcgrew

newbie

Activity: 4

Merit: 0

Quote from: Diapolo on July 07, 2011, 12:59:43 AM

Could you raise your Mem clock to ~350 MHz and report back. What about Worksize of 256, for 5830 cards this helps a lot.

Played with mem clock speeds. 350 saw no improvement, but 600 to 1050 saw a ~.5 Mh/s improvement, but still not enough to get me back the 2 Mh/s I lost.

Work size of 256 dropped off another few Mh/s, so that definitely didn't help. It seems like 07/03/2011 is the winner for me! =)
Thanks for your efforts though, I'll definitely keep testing and see if the newer kernels can return to the 07/03/2011 level.

Bert

full member

Activity: 126

Merit: 100

Quote from: Diapolo on July 08, 2011, 12:30:47 AM

... snip ...
Any ideas and hints are welcome.

Dia

I've been toying with an idea, but I don't have the necessary programming skills (or knowledge of the SHA-256 algorithm) to implement anything.

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf

Quote

41. What is the difference between 24-bit and 32-bit integer operations?

   24-bit operations are faster because they use floating point hardware and can execute on all compute unts. Many 32-bit integer operations also run on all stream processors, but if both a 24-bit and a 32-bit version exist for the same instruction, the 32-bit instruction executes only one per cycle.

43. Do 24-bit integers exist in hardware?

   No, there are 24-bit instructions, such as MUL24/MAD24, but the smallest integer in hardware registers is 32-bits.

75. Is it possible to use all 256 register in a thread?

   No, the compiler limits a wavefront to half of the register pool, so there can always be at least two wavefronts executing in parallel.

http://developer.amd.com/sdks/amdappsdk/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Page 4-62

Quote

24-bit integer MULs and MADs have five times the throughput of 32-bit integer multiplies. 24-bit unsigned integers are natively supported only on the Evergreen family of devices and later. Signed 24-bit integers are supported only on the Northern Island family of devices and later. The use of OpenCL built-in functions for mul24 and mad24 is encouraged. Note that mul24 can be useful for array indexing operations.

http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=144722

Quote

On the 5800 series, signed mul24(a,b) is turned into

Code:

(((a<<8)>>8)*((b<<8)>>8))

. This makes it noticeably SLOWER than simply using a*b. Unsigned mul24(a,b) uses a native function. mad24 is similar. I made some kernels which just looped the same operation over and over:
signed a * b: 0.9736s
unsigned mul24(a,b): 0.9734s
signed mul24(a,b): 2.2771s

So anyhow what I was thinking was the following

Current kernel: 1 * 256 bit hash / 32int = 8 32bit operations (speed 100% )
Possible Kernel: 3 * 256 bit hash / 24int = 32 24bit operations (speed a maximum of 166% [5 times faster divided by 3 SHA-256 operations in parallel])^*

^* It may actually end up being slower than the current kernel.cl if 32bit and 24bit operations are sent as wavefronts at the same time.

There may be some merit in trying to write a new kernel.cl that uses 32 x 24bit integers to carry out 3 parallel SHA-256 operations at once faster than one SHA-256 operation using 8 32bit integers .

But not everything can be carried out as 24bit operations, only mul24(a,b) and mad24(a,b), so the 166% speed up would only be achieved if every SHA-256 operation was covered by these two operations. The new kernel.cl would be limited to modern ATI hardware (54xx-59xx,67xx-69xx), which is generally what miners are using.

But to be honest I haven't looked into the SHA-256 algorithm, so I'm not sure if parts of it could ever be rewritten to utilise mad24(a,b) or mul24(a,b). But I like thinking outside the box.

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: kr105 on July 08, 2011, 01:44:36 AM

Asus EAH5850, core 840, mem 180, volt 1080:

version 2011-07-01: 338mh/s
version 2011-07-03: 336mh/s
version 2011-07-06: 301mh/s
version 2011-07-07: 301mh/s

I'll try to play with core/mem clocks again, because this values was the optimals for the old phatk. Thanks.

I bet 0,1 BTC, that you will reach higher values, with raised mem clocks Cheesy

. Deal?

Dia

kr105

hero member

Activity: 938

Merit: 501

Asus EAH5850, core 840, mem 180, volt 1080:

version 2011-07-01: 338mh/s
version 2011-07-03: 336mh/s
version 2011-07-06: 301mh/s
version 2011-07-07: 301mh/s

I'll try to play with core/mem clocks again, because this values was the optimals for the old phatk. Thanks.

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: burningrave101 on July 08, 2011, 12:28:39 AM

Tested the latest 2011-07-07 kernel on my 6990 @ 880Mhz core using the latest 7/1 version of GUIMiner without any additional kernel tweaks and saw roughly a 15 Mh/s increase. Thanks and hope to see further improvements in hash rate to come

.

It gets's harder after each new version, so I guess next version could take some time

. Any ideas and hints are welcome.

Dia

burningrave101

newbie

Activity: 55

Merit: 0

Tested the latest 2011-07-07 kernel on my 6990 @ 880Mhz core using the latest 7/1 version of GUIMiner without any additional kernel tweaks and saw roughly a 15 Mh/s increase. Thanks and hope to see further improvements in hash rate to come

.

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: gominoa on July 07, 2011, 05:58:10 PM

New version 2011-07-07 works on SDK 2.1 w/ VECTORS.

Thanks

So how does it work for you? Compared to other kernels? Which cards do you use?

Dia

Topic: further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13 - page 10. (Read 51268 times)