Pages:
Author

Topic: Modified Kernel for Phoenix 1.5 - page 9. (Read 96775 times)

newbie
Activity: 52
Merit: 0
August 02, 2011, 01:30:23 AM
sr. member
Activity: 254
Merit: 250
August 02, 2011, 01:21:31 AM
With catalyst 11.6 + SDK 2.1, 975/300 setting, I'm only getting 176 mhs with phatk 2.1

With Diapolo's kernel, I was able to get 314 mhs

I changed it again, if it still doesn't work at all, can you give me some details on the settings you are using?

Now worked! 316 Mhash/sec!

-k phatk DEVICE=1 VECTORS BFI_INT AGGRESSION=11 WORKSIZE=256

And it uses 0% CPU as usual.

Excellent work!
newbie
Activity: 52
Merit: 0
August 02, 2011, 01:15:52 AM
With catalyst 11.6 + SDK 2.1, 975/300 setting, I'm only getting 176 mhs with phatk 2.1

With Diapolo's kernel, I was able to get 314 mhs

I changed it again, if it still doesn't work at all, can you give me some details on the settings you are using?
newbie
Activity: 52
Merit: 0
August 02, 2011, 01:08:02 AM
With catalyst 11.6 + SDK 2.1, 975/300 setting, I'm only getting 176 mhs with phatk 2.1

With Diapolo's kernel, I was able to get 314 mhs

AAAH! I think OpenCL is going to make my head explode.. lol

You are using the OpenCL rotate() instead of amd_bitalign(), what's the benefit here (is it the same under the hood)?

Dia

No, just cleaner.. since it is the same code (well... for SDK 2.4 at least)... it looks like 2.1 does not realize that they are the same and I will have to change it back...

Quote
And, I just posted another kernel... this one is must better to look at than 2.0... I got rid of all but 3 of the SHARound #defines... Check the first page for the link

I'm still getting miner idle errors in guiminer  with  VECTORS BFI_INT -k phatk FASTLOOP=false WORKSIZE=256 AGGRESSION=12 -q2

is it just guiminer?

edit:works fine with aoclbf 1.75.. i wonder why guiminer has such trouble

speed 318 over 315 with diablo 7-17

No clue, I have no used or downloaded GUIMiner, I use aoclbf.  I might be able to take a look at it after figuring out how to make it work for SDK 2.1
sr. member
Activity: 254
Merit: 250
August 02, 2011, 12:57:59 AM
With catalyst 11.6 + SDK 2.1, 975/300 setting, I'm only getting 176 mhs with phatk 2.1

With Diapolo's kernel, I was able to get 314 mhs
hero member
Activity: 772
Merit: 500
August 02, 2011, 12:56:07 AM
You are using the OpenCL rotate() instead of amd_bitalign(), what's the benefit here (is it the same under the hood)?

Dia
sr. member
Activity: 476
Merit: 250
moOo
August 02, 2011, 12:55:32 AM
Quote
And, I just posted another kernel... this one is must better to look at than 2.0... I got rid of all but 3 of the SHARound #defines... Check the first page for the link

I'm still getting miner idle errors in guiminer  with  VECTORS BFI_INT -k phatk FASTLOOP=false WORKSIZE=256 AGGRESSION=12 -q2

is it just guiminer?

edit:works fine with aoclbf 1.75.. i wonder why guiminer has such trouble

speed 318 over 315 with diablo 7-17
newbie
Activity: 52
Merit: 0
August 02, 2011, 12:46:45 AM
oooh, I will have to try that out... boo for AMD
hero member
Activity: 772
Merit: 500
August 02, 2011, 12:45:26 AM

Thank you very much! But bad news, I checked phatk 2.0, my old and my new kernel version and all of em use less GPRs. but 1 - 2 ALU OPs more ... SDK 2.5 is a sucker until (again) some optimisations have been done. Phat, how do you order the commands to achieve best performance, are you using the ASM code from KernelAnalyzer or is it trial and error?

Dia

lol.... mostly trial and error, Initially, for version 1.1, I looked at filling the gaps in the VLIW assembly (see which VLIW5 only had 4 instructions using barrier(0) instructions to see where in the assembly the OpenCL code is), but that took a LONG time, and I think I am done with that... (it turned out it only gave me like 3 ALU ops anyway).


Quote
Seems you are wrong (at least for now):

read it again.. he is asking if it is in 2.4..he says I read here.. it will be in 2.5.. isnt it already in the current one.. meaning 2.4.. they answer,, no it was disabled in the current one, meaning 2.4 as it wasnt fixed in time.

at least that is how i read it... note the dates of the posts.. they have to be talking about 2.4

Yeah, I said that KernelAnalyzer 1.9 was out today saying that it supports 2.5, but 2.5 isn't out yet... probably tomorrow.


And, I just posted another kernel... this one is must better to look at than 2.0... I got rid of all but 3 of the SHARound #defines... Check the first page for the link

Cat 11.8 preview and Cat 11.7 have the SDK 2.5 runtime, so my tests are real :-/.

Dia
newbie
Activity: 52
Merit: 0
August 02, 2011, 12:41:36 AM

Thank you very much! But bad news, I checked phatk 2.0, my old and my new kernel version and all of em use less GPRs. but 1 - 2 ALU OPs more ... SDK 2.5 is a sucker until (again) some optimisations have been done. Phat, how do you order the commands to achieve best performance, are you using the ASM code from KernelAnalyzer or is it trial and error?

Dia

edit:  BTW, I always thought your numbers were a couple lower than mine because you defined OUTPUT_MASK as something like "0x10" or something... doing that makes all my numbers match the ones on your thread
lol.... mostly trial and error, Initially, for version 1.1, I looked at filling the gaps in the VLIW assembly (see which VLIW5 only had 4 instructions using barrier(0) instructions to see where in the assembly the OpenCL code is), but that took a LONG time, and I think I am done with that... (it turned out it only gave me like 3 ALU ops anyway).


Quote
Seems you are wrong (at least for now):

read it again.. he is asking if it is in 2.4..he says I read here.. it will be in 2.5.. isnt it already in the current one.. meaning 2.4.. they answer,, no it was disabled in the current one, meaning 2.4 as it wasnt fixed in time.

at least that is how i read it... note the dates of the posts.. they have to be talking about 2.4

Yeah, I said that KernelAnalyzer 1.9 was out today saying that it supports 2.5, but 2.5 isn't out yet... probably tomorrow.


And, I just posted another kernel... this one is must better to look at than 2.0... I got rid of all but 3 of the SHARound #defines... Check the first page for the link
sr. member
Activity: 476
Merit: 250
moOo
August 02, 2011, 12:36:16 AM
Quote
Seems you are wrong (at least for now):

read it again.. he is asking if it is in 2.4..he says I read here.. it will be in 2.5.. isnt it already in the current one.. meaning 2.4.. they answer,, no it was disabled in the current one, meaning 2.4 as it wasnt fixed in time.

at least that is how i read it... note the dates of the posts.. they have to be talking about 2.4
hero member
Activity: 772
Merit: 500
August 02, 2011, 12:34:27 AM

Thank you very much! But bad news, I checked phatk 2.0, my old and my new kernel version and all of em use less GPRs. but 1 - 2 ALU OPs more ... SDK 2.5 is a sucker until (again) some optimisations have been done. Phat, how do you order the commands to achieve best performance, are you using the ASM code from KernelAnalyzer or is it trial and error?

Dia
hero member
Activity: 772
Merit: 500
August 02, 2011, 12:10:57 AM
Quote
I think someone said that SDK 2.5 is supposed to support BFI_INT natively,

sounds like it

Quote
"In SDK 2.5 we are expanding that, along with other optimizations, to generate BFI instructions."

Seems you are wrong (at least for now):

Quote
The optimization has been disabled in the current SDK due to a bug in the implementation that didn't get fixed in time.

By the way is there any official Download link for the KernelAnalyzer 1.9?

Dia
sr. member
Activity: 476
Merit: 250
moOo
August 01, 2011, 08:21:35 PM
Quote
I think someone said that SDK 2.5 is supposed to support BFI_INT natively,

sounds like it

Quote
"In SDK 2.5 we are expanding that, along with other optimizations, to generate BFI instructions."
newbie
Activity: 52
Merit: 0
August 01, 2011, 06:16:49 PM

change
Code:
#define rot(x, y) amd_bitalign(x, x, (32-y))
#else
#define rot(x, y) rotate(x, y)
#endif
to
Code:
#define rot(x, y) amd_bitalign(x, x, (uint)(32-y))
#else
#define rot(x, y) rotate(x, (uint)(y))
#endif
and
Code:
#define rot2(x, y) rotate(x, y)
to
Code:
#define rot2(x, y) rotate(x, (uint)(y))
If anyone tries this out, let me know if it changes anything.
this works on 2.1 SDK


Awesome, Thanks.  I'll implement the changes and release soon.

On another note, I just was searching through AMD's downloads and the KernelAnalyzer 1.9 just came out today with "Support for AMD APP SDK 2.5."... I think someone said that SDK 2.5 is supposed to support BFI_INT natively, so, maybe we can get some better performance with 2.5 *crosses fingers* Smiley
hero member
Activity: 658
Merit: 500
August 01, 2011, 04:59:10 PM

change
Code:
#define rot(x, y) amd_bitalign(x, x, (32-y))
#else
#define rot(x, y) rotate(x, y)
#endif
to
Code:
#define rot(x, y) amd_bitalign(x, x, (uint)(32-y))
#else
#define rot(x, y) rotate(x, (uint)(y))
#endif
and
Code:
#define rot2(x, y) rotate(x, y)
to
Code:
#define rot2(x, y) rotate(x, (uint)(y))
If anyone tries this out, let me know if it changes anything.
this works on 2.1 SDK
newbie
Activity: 52
Merit: 0
August 01, 2011, 03:16:15 PM
I've done a few things over the weekend (increased performance another ~.5%) and cleaned up my code a lot, so I will release another version when I figure what is causing some of the issues that people are having...

Diapolo, I know you made some modifications to my kernel to make it compatible with 2.1, are they basically type casting issues like the one above?  If I can't figure it out, I may just make all of the constants uint.

I really don't understand, why the compiler needs so much help and why one has to use such ugly code to get the best performance ... I hope AMD can optimize the compiler, so that we can use clean and straight forward code. I tried to reorder the comands and did not change the code itself and it saved 3 ALU OPs ... for nothing. that sucks so bad!

The SDK 2.1 compatibilty was achieved via type-casts in front of hex-values in the code. Simply add (u) in front, where you use such values.

Dia

OMG yeah, I know... They really need to work on the compiler...

I actually work at the US Patent Office and work in instruction processing... VLIW is a fairly new area and there is a lot of new work coming out.. so give it a couple years (sigh)... What you have to remember that compiling VLIW code is extremely complicated (The kernel itself only uses 21 registers) and most of the instructions have to be based solely on the previous instruction.


from Wikipedia [http://en.wikipedia.org/wiki/Very_long_instruction_word]
Quote
As a result, VLIW CPUs offer significant computational power with less hardware complexity (but greater compiler complexity) than is associated with most superscalar CPUs.

As is the case with any novel architectural approach, the concept is only as useful as code generation makes it. That is, the fact that a number of special-purpose instructions are available to facilitate certain complicated operations... is useless if compilers are unable to spot relevant source code constructs and generate target code that duly utilizes the CPU's advanced offerings. Therefore, programmers must be able to express their algorithms in a manner that makes the compiler's task easier.

With all of that said, it would be amazing if you could just write:
Code:
Init1();
for (int n = 0; n != 64; n++)
{
SHARound();
}
Init2();
for (int n = 0; n != 64; n++)
{
SHARound();
}
and let the compiler sort it out...
hero member
Activity: 772
Merit: 500
August 01, 2011, 02:28:40 PM
I've done a few things over the weekend (increased performance another ~.5%) and cleaned up my code a lot, so I will release another version when I figure what is causing some of the issues that people are having...

Diapolo, I know you made some modifications to my kernel to make it compatible with 2.1, are they basically type casting issues like the one above?  If I can't figure it out, I may just make all of the constants uint.

I really don't understand, why the compiler needs so much help and why one has to use such ugly code to get the best performance ... I hope AMD can optimize the compiler, so that we can use clean and straight forward code. I tried to reorder the comands and did not change the code itself and it saved 3 ALU OPs ... for nothing. that sucks so bad!

The SDK 2.1 compatibilty was achieved via type-casts in front of hex-values in the code. Simply add (u) in front, where you use such values.

Dia
newbie
Activity: 52
Merit: 0
August 01, 2011, 12:14:36 PM
Phat, what is the effect of "LLLL" instead of "IIII" in the .py file? It seems to work even with IIII.

Thanks,
Dia

Nothing, I was trying to fix a bug with low WORKSIZE numbers which results in duplicate hashes (not sure if it is solved yet).  Technically, the values are 32-bit which are "L" values instead of 16-bit "I" values, but python seems to handle both the same.

As for all of the other issues, I think there is an issue with SDK 2.1 with my kernel.  I will try explicitly declaring the rotation constant as uint instead of int (that may fix the problem)
if anyone with SDK 2.1 wants to help out:
change
Code:
#define rot(x, y) amd_bitalign(x, x, (32-y))
#else
#define rot(x, y) rotate(x, y)
#endif
to
Code:
#define rot(x, y) amd_bitalign(x, x, (uint)(32-y))
#else
#define rot(x, y) rotate(x, (uint)(y))
#endif
and
Code:
#define rot2(x, y) rotate(x, y)
to
Code:
#define rot2(x, y) rotate(x, (uint)(y))
If anyone tries this out, let me know if it changes anything.


I've done a few things over the weekend (increased performance another ~.5%) and cleaned up my code a lot, so I will release another version when I figure what is causing some of the issues that people are having...

Diapolo, I know you made some modifications to my kernel to make it compatible with 2.1, are they basically type casting issues like the one above?  If I can't figure it out, I may just make all of the constants uint.

Also, one more thing, does "rotate(x, y)" compile to 1 instruction in SDK 2.1?  Running 2.4, explicitly using amd_bitalign does not improve performance (might be cleaner if I can just use rotate(x, y) regardless of whether BITALIGN is defined).

I was also thinking of possibly just precompiling different versions of the kernel and using them, therefore, you'd be able to use the faster 2.4 kernel even if you use SDK 2.1.  I'm not sure if this is possible, but I will look into it.
Pages:
Jump to: