Modified Kernel for Phoenix 1.5 - page 9.

Phateus

newbie

Activity: 52

Merit: 0

pennytrader

sr. member

Activity: 254

Merit: 250

Quote from: Phateus on August 02, 2011, 01:15:52 AM

Quote from: pennytrader on August 02, 2011, 12:57:59 AM

With catalyst 11.6 + SDK 2.1, 975/300 setting, I'm only getting 176 mhs with phatk 2.1

With Diapolo's kernel, I was able to get 314 mhs

I changed it again, if it still doesn't work at all, can you give me some details on the settings you are using?

Now worked! 316 Mhash/sec!

-k phatk DEVICE=1 VECTORS BFI_INT AGGRESSION=11 WORKSIZE=256

And it uses 0% CPU as usual.

Excellent work!

Phateus

newbie

Activity: 52

Merit: 0

Quote from: pennytrader on August 02, 2011, 12:57:59 AM

With catalyst 11.6 + SDK 2.1, 975/300 setting, I'm only getting 176 mhs with phatk 2.1

With Diapolo's kernel, I was able to get 314 mhs

I changed it again, if it still doesn't work at all, can you give me some details on the settings you are using?

Phateus

newbie

Activity: 52

Merit: 0

Quote from: pennytrader on August 02, 2011, 12:57:59 AM

With catalyst 11.6 + SDK 2.1, 975/300 setting, I'm only getting 176 mhs with phatk 2.1

With Diapolo's kernel, I was able to get 314 mhs

AAAH! I think OpenCL is going to make my head explode.. lol

Quote from: Diapolo on August 02, 2011, 12:56:07 AM

You are using the OpenCL rotate() instead of amd_bitalign(), what's the benefit here (is it the same under the hood)?

Dia

No, just cleaner.. since it is the same code (well... for SDK 2.4 at least)... it looks like 2.1 does not realize that they are the same and I will have to change it back...

Quote from: joulesbeef on August 02, 2011, 12:55:32 AM

Quote

And, I just posted another kernel... this one is must better to look at than 2.0... I got rid of all but 3 of the SHARound #defines... Check the first page for the link

I'm still getting miner idle errors in guiminer with VECTORS BFI_INT -k phatk FASTLOOP=false WORKSIZE=256 AGGRESSION=12 -q2

is it just guiminer?

edit:works fine with aoclbf 1.75.. i wonder why guiminer has such trouble

speed 318 over 315 with diablo 7-17

No clue, I have no used or downloaded GUIMiner, I use aoclbf. I might be able to take a look at it after figuring out how to make it work for SDK 2.1

pennytrader

sr. member

Activity: 254

Merit: 250

With catalyst 11.6 + SDK 2.1, 975/300 setting, I'm only getting 176 mhs with phatk 2.1

With Diapolo's kernel, I was able to get 314 mhs

Diapolo

hero member

Activity: 772

Merit: 500

You are using the OpenCL rotate() instead of amd_bitalign(), what's the benefit here (is it the same under the hood)?

Dia

joulesbeef

sr. member

Activity: 476

Merit: 250

moOo

Quote

And, I just posted another kernel... this one is must better to look at than 2.0... I got rid of all but 3 of the SHARound #defines... Check the first page for the link

I'm still getting miner idle errors in guiminer with VECTORS BFI_INT -k phatk FASTLOOP=false WORKSIZE=256 AGGRESSION=12 -q2

is it just guiminer?

edit:works fine with aoclbf 1.75.. i wonder why guiminer has such trouble

speed 318 over 315 with diablo 7-17

Phateus

newbie

Activity: 52

Merit: 0

oooh, I will have to try that out... boo for AMD

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: Phateus on August 02, 2011, 12:41:36 AM

Quote from: Diapolo on August 02, 2011, 12:34:27 AM

Quote from: dishwara on August 02, 2011, 12:13:43 AM

Quote from: Diapolo on August 02, 2011, 12:10:57 AM

By the way is there any official Download link for the KernelAnalyzer 1.9?
Dia

http://developer.amd.com/TOOLS/AMDAPPKERNELANALYZER/Pages/default.aspx
http://developer.amd.com/Downloads/AMDAPPKernelAnalyzer-v1.9.1016.msi

Thank you very much! But bad news, I checked phatk 2.0, my old and my new kernel version and all of em use less GPRs. but 1 - 2 ALU OPs more ... SDK 2.5 is a sucker until (again) some optimisations have been done. Phat, how do you order the commands to achieve best performance, are you using the ASM code from KernelAnalyzer or is it trial and error?

Dia

lol.... mostly trial and error, Initially, for version 1.1, I looked at filling the gaps in the VLIW assembly (see which VLIW5 only had 4 instructions using barrier(0) instructions to see where in the assembly the OpenCL code is), but that took a LONG time, and I think I am done with that... (it turned out it only gave me like 3 ALU ops anyway).

Quote from: joulesbeef on August 02, 2011, 12:36:16 AM

Quote

Seems you are wrong (at least for now):

read it again.. he is asking if it is in 2.4..he says I read here.. it will be in 2.5.. isnt it already in the current one.. meaning 2.4.. they answer,, no it was disabled in the current one, meaning 2.4 as it wasnt fixed in time.

at least that is how i read it... note the dates of the posts.. they have to be talking about 2.4

Yeah, I said that KernelAnalyzer 1.9 was out today saying that it supports 2.5, but 2.5 isn't out yet... probably tomorrow.

And, I just posted another kernel... this one is must better to look at than 2.0... I got rid of all but 3 of the SHARound #defines... Check the first page for the link

Cat 11.8 preview and Cat 11.7 have the SDK 2.5 runtime, so my tests are real :-/.

Dia

Phateus

newbie

Activity: 52

Merit: 0

Quote from: Diapolo on August 02, 2011, 12:34:27 AM

Quote from: dishwara on August 02, 2011, 12:13:43 AM

Quote from: Diapolo on August 02, 2011, 12:10:57 AM

By the way is there any official Download link for the KernelAnalyzer 1.9?
Dia

http://developer.amd.com/TOOLS/AMDAPPKERNELANALYZER/Pages/default.aspx
http://developer.amd.com/Downloads/AMDAPPKernelAnalyzer-v1.9.1016.msi

Thank you very much! But bad news, I checked phatk 2.0, my old and my new kernel version and all of em use less GPRs. but 1 - 2 ALU OPs more ... SDK 2.5 is a sucker until (again) some optimisations have been done. Phat, how do you order the commands to achieve best performance, are you using the ASM code from KernelAnalyzer or is it trial and error?

Dia

edit: BTW, I always thought your numbers were a couple lower than mine because you defined OUTPUT_MASK as something like "0x10" or something... doing that makes all my numbers match the ones on your thread
lol.... mostly trial and error, Initially, for version 1.1, I looked at filling the gaps in the VLIW assembly (see which VLIW5 only had 4 instructions using barrier(0) instructions to see where in the assembly the OpenCL code is), but that took a LONG time, and I think I am done with that... (it turned out it only gave me like 3 ALU ops anyway).

Quote from: joulesbeef on August 02, 2011, 12:36:16 AM

Quote

Seems you are wrong (at least for now):

read it again.. he is asking if it is in 2.4..he says I read here.. it will be in 2.5.. isnt it already in the current one.. meaning 2.4.. they answer,, no it was disabled in the current one, meaning 2.4 as it wasnt fixed in time.

at least that is how i read it... note the dates of the posts.. they have to be talking about 2.4

Yeah, I said that KernelAnalyzer 1.9 was out today saying that it supports 2.5, but 2.5 isn't out yet... probably tomorrow.

And, I just posted another kernel... this one is must better to look at than 2.0... I got rid of all but 3 of the SHARound #defines... Check the first page for the link

joulesbeef

sr. member

Activity: 476

Merit: 250

moOo

Quote

Seems you are wrong (at least for now):

read it again.. he is asking if it is in 2.4..he says I read here.. it will be in 2.5.. isnt it already in the current one.. meaning 2.4.. they answer,, no it was disabled in the current one, meaning 2.4 as it wasnt fixed in time.

at least that is how i read it... note the dates of the posts.. they have to be talking about 2.4

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: dishwara on August 02, 2011, 12:13:43 AM

Quote from: Diapolo on August 02, 2011, 12:10:57 AM

By the way is there any official Download link for the KernelAnalyzer 1.9?
Dia

http://developer.amd.com/TOOLS/AMDAPPKERNELANALYZER/Pages/default.aspx
http://developer.amd.com/Downloads/AMDAPPKernelAnalyzer-v1.9.1016.msi

Thank you very much! But bad news, I checked phatk 2.0, my old and my new kernel version and all of em use less GPRs. but 1 - 2 ALU OPs more ... SDK 2.5 is a sucker until (again) some optimisations have been done. Phat, how do you order the commands to achieve best performance, are you using the ASM code from KernelAnalyzer or is it trial and error?

Dia

dishwara

legendary

Activity: 1855

Merit: 1016

Quote from: Diapolo on August 02, 2011, 12:10:57 AM

By the way is there any official Download link for the KernelAnalyzer 1.9?
Dia

http://developer.amd.com/TOOLS/AMDAPPKERNELANALYZER/Pages/default.aspx
http://developer.amd.com/Downloads/AMDAPPKernelAnalyzer-v1.9.1016.msi

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: joulesbeef on August 01, 2011, 08:21:35 PM

Quote

I think someone said that SDK 2.5 is supposed to support BFI_INT natively,

sounds like it

Quote

"In SDK 2.5 we are expanding that, along with other optimizations, to generate BFI instructions."

Seems you are wrong (at least for now):

Quote

The optimization has been disabled in the current SDK due to a bug in the implementation that didn't get fixed in time.

By the way is there any official Download link for the KernelAnalyzer 1.9?

Dia

joulesbeef

sr. member

Activity: 476

Merit: 250

moOo

Quote

I think someone said that SDK 2.5 is supposed to support BFI_INT natively,

sounds like it

Quote

"In SDK 2.5 we are expanding that, along with other optimizations, to generate BFI instructions."

Phateus

newbie

Activity: 52

Merit: 0

Quote from: iopq on August 01, 2011, 04:59:10 PM

Quote from: Phateus on August 01, 2011, 12:14:36 PM

change

Code:

#define rot(x, y) amd_bitalign(x, x, (32-y))
#else
#define rot(x, y) rotate(x, y)
#endif

to

Code:

#define rot(x, y) amd_bitalign(x, x, (uint)(32-y))
#else
#define rot(x, y) rotate(x, (uint)(y))
#endif

and

Code:

#define rot2(x, y) rotate(x, y)

to

Code:

#define rot2(x, y) rotate(x, (uint)(y))

If anyone tries this out, let me know if it changes anything.

this works on 2.1 SDK

Awesome, Thanks. I'll implement the changes and release soon.

On another note, I just was searching through AMD's downloads and the KernelAnalyzer 1.9 just came out today with "Support for AMD APP SDK 2.5."... I think someone said that SDK 2.5 is supposed to support BFI_INT natively, so, maybe we can get some better performance with 2.5 *crosses fingers*

iopq

hero member

Activity: 658

Merit: 500

Quote from: Phateus on August 01, 2011, 12:14:36 PM

change

Code:

#define rot(x, y) amd_bitalign(x, x, (32-y))
#else
#define rot(x, y) rotate(x, y)
#endif

to

Code:

#define rot(x, y) amd_bitalign(x, x, (uint)(32-y))
#else
#define rot(x, y) rotate(x, (uint)(y))
#endif

and

Code:

#define rot2(x, y) rotate(x, y)

to

Code:

#define rot2(x, y) rotate(x, (uint)(y))

If anyone tries this out, let me know if it changes anything.

this works on 2.1 SDK

Phateus

newbie

Activity: 52

Merit: 0

Quote from: Diapolo on August 01, 2011, 02:28:40 PM

Quote from: Phateus on August 01, 2011, 12:14:36 PM

I've done a few things over the weekend (increased performance another ~.5%) and cleaned up my code a lot, so I will release another version when I figure what is causing some of the issues that people are having...

Diapolo, I know you made some modifications to my kernel to make it compatible with 2.1, are they basically type casting issues like the one above? If I can't figure it out, I may just make all of the constants uint.

I really don't understand, why the compiler needs so much help and why one has to use such ugly code to get the best performance ... I hope AMD can optimize the compiler, so that we can use clean and straight forward code. I tried to reorder the comands and did not change the code itself and it saved 3 ALU OPs ... for nothing. that sucks so bad!

The SDK 2.1 compatibilty was achieved via type-casts in front of hex-values in the code. Simply add (u) in front, where you use such values.

Dia

OMG yeah, I know... They really need to work on the compiler...

I actually work at the US Patent Office and work in instruction processing... VLIW is a fairly new area and there is a lot of new work coming out.. so give it a couple years (sigh)... What you have to remember that compiling VLIW code is extremely complicated (The kernel itself only uses 21 registers) and most of the instructions have to be based solely on the previous instruction.

from Wikipedia [http://en.wikipedia.org/wiki/Very_long_instruction_word]

Quote

As a result, VLIW CPUs offer significant computational power with less hardware complexity (but greater compiler complexity) than is associated with most superscalar CPUs.

As is the case with any novel architectural approach, the concept is only as useful as code generation makes it. That is, the fact that a number of special-purpose instructions are available to facilitate certain complicated operations... is useless if compilers are unable to spot relevant source code constructs and generate target code that duly utilizes the CPU's advanced offerings. Therefore, programmers must be able to express their algorithms in a manner that makes the compiler's task easier.

With all of that said, it would be amazing if you could just write:

Code:

Init1();
for (int n = 0; n != 64; n++)
{
SHARound();
}
Init2();
for (int n = 0; n != 64; n++)
{
SHARound();
}

and let the compiler sort it out...

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: Phateus on August 01, 2011, 12:14:36 PM

I've done a few things over the weekend (increased performance another ~.5%) and cleaned up my code a lot, so I will release another version when I figure what is causing some of the issues that people are having...

Diapolo, I know you made some modifications to my kernel to make it compatible with 2.1, are they basically type casting issues like the one above? If I can't figure it out, I may just make all of the constants uint.

I really don't understand, why the compiler needs so much help and why one has to use such ugly code to get the best performance ... I hope AMD can optimize the compiler, so that we can use clean and straight forward code. I tried to reorder the comands and did not change the code itself and it saved 3 ALU OPs ... for nothing. that sucks so bad!

The SDK 2.1 compatibilty was achieved via type-casts in front of hex-values in the code. Simply add (u) in front, where you use such values.

Dia

Phateus

newbie

Activity: 52

Merit: 0

Quote from: Diapolo on July 30, 2011, 08:00:59 PM

Phat, what is the effect of "LLLL" instead of "IIII" in the .py file? It seems to work even with IIII.

Thanks,
Dia

Nothing, I was trying to fix a bug with low WORKSIZE numbers which results in duplicate hashes (not sure if it is solved yet). Technically, the values are 32-bit which are "L" values instead of 16-bit "I" values, but python seems to handle both the same.

As for all of the other issues, I think there is an issue with SDK 2.1 with my kernel. I will try explicitly declaring the rotation constant as uint instead of int (that may fix the problem)
if anyone with SDK 2.1 wants to help out:
change

Code:

#define rot(x, y) amd_bitalign(x, x, (32-y))
#else
#define rot(x, y) rotate(x, y)
#endif

to

Code:

#define rot(x, y) amd_bitalign(x, x, (uint)(32-y))
#else
#define rot(x, y) rotate(x, (uint)(y))
#endif

and

Code:

#define rot2(x, y) rotate(x, y)

to

Code:

#define rot2(x, y) rotate(x, (uint)(y))

If anyone tries this out, let me know if it changes anything.

I've done a few things over the weekend (increased performance another ~.5%) and cleaned up my code a lot, so I will release another version when I figure what is causing some of the issues that people are having...

Diapolo, I know you made some modifications to my kernel to make it compatible with 2.1, are they basically type casting issues like the one above? If I can't figure it out, I may just make all of the constants uint.

Also, one more thing, does "rotate(x, y)" compile to 1 instruction in SDK 2.1? Running 2.4, explicitly using amd_bitalign does not improve performance (might be cleaner if I can just use rotate(x, y) regardless of whether BITALIGN is defined).

I was also thinking of possibly just precompiling different versions of the kernel and using them, therefore, you'd be able to use the faster 2.4 kernel even if you use SDK 2.1. I'm not sure if this is possible, but I will look into it.

Topic: Modified Kernel for Phoenix 1.5 - page 9. (Read 96849 times)