[ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 15.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: realhet on January 05, 2015, 12:12:13 PM

First 2 optimizations are done, I wrote a blog post about them. I'm at 2.65x now.

Thanks very much!
Unfortunately it appears the two optimisations are hard to implement in opencl: minimum code size I was able to achieve was 50K, far from 32k, and reducing the number of variables as much as possible didn't provide any speed up. Maybe the number of vregs is still higher than 128...

realhet

newbie

Activity: 32

Merit: 0

First 2 optimizations are done, I wrote a blog post about them. I'm at 2.65x now.

realhet

newbie

Activity: 32

Merit: 0

Hi,

The Groestl asm code is opensource (I just uploaded it). My compiler and IDE is closed source though, but once you compiled the kernel with it into an .ELF binary, you can use it even on Linux, not just Win.

The first asm version is documented on my blog. Check it out here -> http://realhet.wordpress.com/
It's only a development version, and the kernel parameters are incompatible with Pallas's OpenCL kernel. I have a hard time reverse engineering how params are passed through registers, not mentioning that it can be different in every catalyst version so I keep parameters simple. One buffer with pinned memory for everything data IO is the fastest anyways.
I'm planning to post about many optimizations. Let's see how far can I go. With using only 128 VGPRS it is already at 2.3x speedup and I'm expecting more. Grin

I believe that OCL is so generalized and is kinda far from the actual GCN hardware that it is worth for some projects to go low level. (Not all projects: For example I have failed with LiteCoin. It's better for it to stay in maintainable OCL code.)

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: realhet on January 03, 2015, 02:34:53 PM

Hi again,

Finally I'm at the point that it first time ever produced a correct result.
The speed test was surprisingly good: HD7770 1000MHz (640 streams, GCN1.0 chip), Cat:14.9(the 20% slower driver), total workitems: 256*10*512, elapsed: 558.613 ms, 4.693 MH/s, gain: 1.17x where the baseline is the opencl implementation (found on amd.com at Wolf0's post) which is 4.00MH/s.

And the first optimization was really a cheap shot Grin

. Unlike ocl, I was able to made it under 128 VGPRS (I use 120 currently, it was kinda close). So as each Vector ALU can choose from 2 wavefronts at any time, latency hiding finally kicked in -> elapsed: 279.916 ms 9.365 MH/s gain: 2.34x

And I'm full of ideas to try Cheesy

Next will be to shrink the code to fit into the 32KB instruction cache. Now it is 300kb, it's a massive macro unroll at the moment. The original pallas' ocl version is 110kb, wonder why 3x the multiplier though. Anyways, on GCN we can have loops with only 1 clycle overheads, or even I can write subroutines with call/ret instructions, so I gotta try that fast it is when the instruction cache has no misses at all.

OpenCL thing: While I simplify the code (I chopped down the first/last round optimizations because they would be hard to implement in asm atm) I noticed that I knew already from the past: The OpenCL->llvm-> amd_il -> gcn_asm toolchain will eliminate all the constant calculations and all the calculations whose results is not used at all. I watched the times while making these modifications and it stayed around 4MH/s. Sometimes it dropped below 3.7 when I put measurement code at various places to compare the original kernel with my kernel: if(gid==1234 && flag==1) for(int i=0; i<16; ++i) output = g;

Great progress, very interesting!
The first improvement, 1.17x, is about the same as the 20% that is lost on 14.9 compared to 14.6 beta, so the two implementations are equivalent.
The second, 2.34x, is really impressive: I have tried multiple times to reduce the number of variables as much as possible (down to 3x16 ulong arrays, 2 ulong and 2 uint), but the results were always worse, so probably that improvement can't be implemented in opencl, or at least I don't know how to.
The same for code size and instruction cache: I was able to squeeze it to about 50K, but at a speed loss.
About the compiler than can eliminate the constant calculations: I noticed that, but doing it by hand works best both in terms of speed and kernel size.
Finally, a question about your work: do you plan to opensource it?

realhet

newbie

Activity: 32

Merit: 0

Hi again,

Finally I'm at the point that it first time ever produced a correct result.
The speed test was surprisingly good: HD7770 1000MHz (640 streams, GCN1.0 chip), Cat:14.9(the 20% slower driver), total workitems: 256*10*512, elapsed: 558.613 ms, 4.693 MH/s, gain: 1.17x where the baseline is the opencl implementation (found on amd.com at Wolf0's post) which is 4.00MH/s.

And the first optimization was really a cheap shot Grin

. Unlike ocl, I was able to made it under 128 VGPRS (I use 120 currently, it was kinda close). So as each Vector ALU can choose from 2 wavefronts at any time, latency hiding finally kicked in -> elapsed: 279.916 ms 9.365 MH/s gain: 2.34x

And I'm full of ideas to try Cheesy

Next will be to shrink the code to fit into the 32KB instruction cache. Now it is 300kb, it's a massive macro unroll at the moment. The original pallas' ocl version is 110kb, wonder why 3x the multiplier though. Anyways, on GCN we can have loops with only 1 clycle overheads, or even I can write subroutines with call/ret instructions, so I gotta try that fast it is when the instruction cache has no misses at all.

OpenCL thing: While I simplify the code (I chopped down the first/last round optimizations because they would be hard to implement in asm atm) I noticed that I knew already from the past: The OpenCL->llvm-> amd_il -> gcn_asm toolchain will eliminate all the constant calculations and all the calculations whose results is not used at all. I watched the times while making these modifications and it stayed around 4MH/s. Sometimes it dropped below 3.7 when I put measurement code at various places to compare the original kernel with my kernel: if(gid==1234 && flag==1) for(int i=0; i<16; ++i) output = g;

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: realhet on December 31, 2014, 05:59:08 PM

Hi All,

I registered here because I need a little help from you, who develops this OpenCL kernel.
A month ago I've found the Groestl algo on the amd dev forums, thanks to Wolf0 who mentioned it on there. I thought it will be a good algo to test my skills in GCN asm, and I'd like to play with it, maybe I can optimize it better than the OCL compiler (or maybe not, but at least I can learn from it anyways).

So the help I'm seeking is this:
- Please send me the latest version of this kernel (I see everyone altering it a bit, just don't know which is which)
- And pls give me a test vector with these things:
- global kernel dimensions, workgroup size(I guess it's 256)
- kernel parameters: dump "char *block", and the "target" value
- And of course the above testcase must find a GroestlCoin hash.

Thank you in advance

(I already sent it to Wolf0 on the amd dev forums, but the moderation there can take more time there and later I found this more appropriate place for my question)

And have a Happy New Year, btw

I don't check there often - how exactly do you do GCN ASM? I'm interested.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Is anyone willing to donate or lend a 285 so I can optimise for Tonga?

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Yes they are two chained iterations of groestl.
But they run a bit different code: the first is optimised because part of the input is known in advance and the second because the whole hash is not needed.

utahjohn

hero member

Activity: 630

Merit: 500

@realhet
please share your work with the rest of us if it works out that assembly optimization works out. Looked at r9-285 review today, looks promising as long as smaller memory bus (256 vs 384) doesn't bottleneck. Should be faster than 280x and on par or maybe even better than 290 with lower power requirement ...
May need tweaks for each architecture ... can it be written to detect which card it's running on and auto select best?

A quote from AnandTech review :

Quote

A complete Tonga configuration will contain 2048 SPs, just like its Tahiti predecessor, with 1792 of those SPs active on R9 285. This is paired with the card’s 32 ROPs attached to a 256-bit memory bus, and a 4-wide (4 geometry processor) frontend. Compared to Tahiti the most visible change is the memory bus size, which has gone from 384-bit to 256-bit. In our look at GCN 1.2 we’ll see why AMD is able to get away with this – the short answer is compression – but it’s notable since at an architectural level Tahiti had to use a memory crossbar between the ROPs and memory bus due to their mismatched size (each block of 4 ROPs wants to be paired with a 32bit memory channel). The crossbar on Tahiti exposes the cards to more memory bandwidth, but it also introduces some inefficiencies of its own that make the subject a tradeoff.

Meanwhile Tonga’s geometry frontend has received an upgrade similar to Hawaii’s, expanding the number of geometry units (and number of polygons per clock) from 2 to 4. And there are actually some additional architectural efficiency improvements in here that should further push performance per clock beyond what Hawaii can do in the real world.

realhet

newbie

Activity: 32

Merit: 0

"T0 and T1 are not in gpu ram: it would be much slower if they were."

Thanks for the ideas!

Actually I knew it from the disasm, that it uses ram instead of LDS for T0, T1. (Note that there is no such thing as constant memory in GCN. It can read a single value with the Scalar ALU and broadcast it across all the wavefront's workitems or it can read 64 values for a whole wavefront by the Vector ALU. Because T0 is addressed by data, it must be read by the VALU using L1 cache (there is a scalar cache too)).
And from there I had the idea of balancing the two sources (LDS and L1).

I did a simple test: renamed T0 and T1, and allocated a new T0 and a T1 from __local. And then initialized them properly. Result: all tbuffer memory read instructions disappeared from the disasm, and the hash rate is dropped from 3.99 MH/s down to 3.841. Don't know how much is the penalty of copying T0, and T1 into the LDS, though.
By the 'textbook': L1 cache can read 4bytes/cycle, LDS: 8bytes/cycle

And yes, the OpenCL compiler is totally unpredictable.

Important question: In the MH/s calculation 1 kernel thread execution means 2 Hashes, right?

(I have a HD7770 @1000MHz, and it's at 4MH/s which looks similar to Wolf0's report on dev.amd.com: R9 290 @1200 20MH/s. Using 14.9 where the compiler generates slower code.)

Now I have to convert all the math into asm. That's painful Cheesy

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: pallas on November 23, 2014, 11:15:49 AM

Quote from: Wolf0 on November 23, 2014, 10:06:50 AM

Quote from: pallas on November 23, 2014, 09:39:19 AM

Quote from: cryptonit on November 23, 2014, 07:23:11 AM

@pallas could u find the actual state of the art mining software for DMD Groestl and post links in DMD ANN we then will update software on website

it would be great if it include ur performance boost tricks already.....

i think no one from our core team runs AMD cards any longer so ur help would be welcome

the problem with my kernel is that, no matter how hard I try, I can't get the best hashrate on 14.9 drivers (only 20 Mh/s vs 25 with 14.6), so it's not enough to just replace diamond.cl on sgminer 4.1 or 5.
that's why I still prefer people visit this post, with all the info and troubleshooting, for best performance.
the only way to make it clean is creating a fork of sgminer, for tahiti and hawaii cards only, with the precompiled binary; some changes are needed in order for it to always use the binary and not compile the cl sources.
not sure I like it but it might work for many... what do you think?

Just an aside - I've gotten the same results - 21MH/s vs. 25MH/s. It's frustrating - but all I've tried is the lookup table implementation, so far.

Well, that means there is probably little room for improvements on that kind of implementation.
I'm curious to see if a bitslice version can be faster on AMD gpus, but I have no time (and no interest because of negative revenue) to try it myself.

I think it might be - 14.9 killed my X11 hashrate at first, down from 10MH/s on 290X to 2 point something. After redesigning Groestl, still based on lookup tables, I got it back up to 6.5MH/s or so. Still dismal...

Atomicat

legendary

Activity: 952

Merit: 1002

Learn something new every day. My instinct is to push it till it moves, crank it to 11, but that doesn't work with the R9-290. Doesn't work because it's throttling for power considerations long before you're hitting 1150. Just dropped my voltages right down and finally got 23.5 at 1125, I-20. New understanding of how to handle this card will make for better benchmarks, for true.

Oh, nice price jump today, from 60k to 70k. Yeah, I'll take credit for that. Put some orders up last night, woke up to find that I basically owned it on Cryptsy! Drop a line with your DMD wallet address, I'll give you well earned reward from my ill gotten gains.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: realhet on January 01, 2015, 04:22:12 AM

Quote from: Wolf0 on December 31, 2014, 07:17:47 PM

how exactly do you do GCN ASM? I'm interested.

I wrote an assembler for it. You can try it at realhet.wordpress.com. (Use Cat 13.4 or older, otherwise examples will crash.)

My first thoughts compiling the OCL kernel (on a 7770):
- Its 2.5 times bigger than the instruction cache. (and there are no loops in it, so I guess it often reads from ram.)
- T0 and T1 is located in the gpu ram.
- VReg count is above 128. -> that allows only the minimum no of 4 wavefronts/CU. So there are no
latency hiding via parallel wavefronts.
- too short kernel with too much initialization: Ideally I'd let every workgroup run for a minimum of 0.5 sec. So kernel launch and LDS table initialization would take no time compared to the actual work.
- better instructions: BitFieldExtract for 64bit rotate, ds_read2_b64 for 128 bit LDS read.
- balancing load between LDS and L1 cache

I don't know which of the above is an actual bottleneck or will be usefull, but I wanna find out.

I'm going to try your assembler, very interesting projects!
About your observations, first of all keep in mind that the compiler is pretty unpredictable: many optimizations just do not make sense but they work. Also I only tested it with Tahiti and Hawaii cards.
Kernel size: it can easily be made smaller (for example by including a single table instead of 2), but in all my tests it doesn't bring any advantage.
T0 and T1 are not in gpu ram: it would be much slower if they were. They are in constant ram, I believe.
Short kernel: even though you might design it in order to process multiple hashes in a single run, I think it's not worth. Simple proof: algos which are tens of times faster than groestl, like keccak, still do a single hash per kernel run. Another reason is that making the kernel last longer will result in more rejected shares.
Balancing load between local ram and cache (or whatever balancing of memory reads): I believe that many optimizations that do not make sense, work because they intrudoduce little delays that permit better memory reads between the threads. They sort of better fit together. In fact, modifying the code on other parts of the code may make the same optimization worthless. Interesting speed variations may be brought by switching instructions or grouping local ram reads differently, for example.

Hope that helps.

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: pallas on November 23, 2014, 09:39:19 AM

Quote from: cryptonit on November 23, 2014, 07:23:11 AM

@pallas could u find the actual state of the art mining software for DMD Groestl and post links in DMD ANN we then will update software on website

it would be great if it include ur performance boost tricks already.....

i think no one from our core team runs AMD cards any longer so ur help would be welcome

the problem with my kernel is that, no matter how hard I try, I can't get the best hashrate on 14.9 drivers (only 20 Mh/s vs 25 with 14.6), so it's not enough to just replace diamond.cl on sgminer 4.1 or 5.
that's why I still prefer people visit this post, with all the info and troubleshooting, for best performance.
the only way to make it clean is creating a fork of sgminer, for tahiti and hawaii cards only, with the precompiled binary; some changes are needed in order for it to always use the binary and not compile the cl sources.
not sure I like it but it might work for many... what do you think?

Just an aside - I've gotten the same results - 21MH/s vs. 25MH/s. It's frustrating - but all I've tried is the lookup table implementation, so far.

realhet

newbie

Activity: 32

Merit: 0

Quote from: Wolf0 on December 31, 2014, 07:17:47 PM

how exactly do you do GCN ASM? I'm interested.

I wrote an assembler for it. You can try it at realhet.wordpress.com. (Use Cat 13.4 or older, otherwise examples will crash.)

My first thoughts compiling the OCL kernel (on a 7770):
- Its 2.5 times bigger than the instruction cache. (and there are no loops in it, so I guess it often reads from ram.)
- T0 and T1 is located in the gpu ram.
- VReg count is above 128. -> that allows only the minimum no of 4 wavefronts/CU. So there are no
latency hiding via parallel wavefronts.
- too short kernel with too much initialization: Ideally I'd let every workgroup run for a minimum of 0.5 sec. So kernel launch and LDS table initialization would take no time compared to the actual work.
- better instructions: BitFieldExtract for 64bit rotate, ds_read2_b64 for 128 bit LDS read.
- balancing load between LDS and L1 cache

I don't know which of the above is an actual bottleneck or will be usefull, but I wanna find out.

realhet

newbie

Activity: 32

Merit: 0

Hi All,

I registered here because I need a little help from you, who develops this OpenCL kernel.
A month ago I've found the Groestl algo on the amd dev forums, thanks to Wolf0 who mentioned it on there. I thought it will be a good algo to test my skills in GCN asm, and I'd like to play with it, maybe I can optimize it better than the OCL compiler (or maybe not, but at least I can learn from it anyways).

So the help I'm seeking is this:
- Please send me the latest version of this kernel (I see everyone altering it a bit, just don't know which is which)
- And pls give me a test vector with these things:
- global kernel dimensions, workgroup size(I guess it's 256)
- kernel parameters: dump "char *block", and the "target" value
- And of course the above testcase must find a GroestlCoin hash.

Thank you in advance

(I already sent it to Wolf0 on the amd dev forums, but the moderation there can take more time there and later I found this more appropriate place for my question)

And have a Happy New Year, btw

lpedretti

full member

Activity: 152

Merit: 100

I was having issues using the optimized cl and precompiled binaries, no HW but there were very ocassional shares and pools reported me a very low hashrate, however the problem was the sgminer version i was using, i'm now using the sgminer-develop that has neoscrypt optimized kernels and with that version it works like a charm!
Running Lubuntu 14.04 with 14.x (don't remember which one)
Clock at 930, 0.95v, 13.5 Mh/s each XFX-7970DD and Gigabyte 280x windforce

Great job!

Best regards!

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Could someone please share their hashrate with r9 285? I'm curious to see if it outperforms the 280 and how much power it uses.

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: pallas on November 04, 2014, 08:27:57 AM

Quote from: Wolf0 on November 04, 2014, 08:09:20 AM

Quote from: pallas on November 04, 2014, 07:50:23 AM

Thanks, but I've already tried any combination of bitwise operations and vectors (as_uchar...): I could make it work but hashrate is about 20 Mh/s vs 25 Mh/s of 14.6 beta.

Ah, I see - I just saw it go from 7MH/s to... I think 20, on 14.9, so I figured it worked; never mind, then.

It's funny how some little changes lead to huge hashrate drops (depending on compiler version); but it's true for memory intensive algos only, as far as I can see.
Maybe your own version doesn't have this problem, then ;-)

Not true for only memory intensive algos - one little screwup and the idiot compiler will double the size of your code, it won't fit in the code cache, and be slow lol

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: Wolf0 on November 23, 2014, 10:06:50 AM

Quote from: pallas on November 23, 2014, 09:39:19 AM

Quote from: cryptonit on November 23, 2014, 07:23:11 AM

@pallas could u find the actual state of the art mining software for DMD Groestl and post links in DMD ANN we then will update software on website

it would be great if it include ur performance boost tricks already.....

i think no one from our core team runs AMD cards any longer so ur help would be welcome

the problem with my kernel is that, no matter how hard I try, I can't get the best hashrate on 14.9 drivers (only 20 Mh/s vs 25 with 14.6), so it's not enough to just replace diamond.cl on sgminer 4.1 or 5.
that's why I still prefer people visit this post, with all the info and troubleshooting, for best performance.
the only way to make it clean is creating a fork of sgminer, for tahiti and hawaii cards only, with the precompiled binary; some changes are needed in order for it to always use the binary and not compile the cl sources.
not sure I like it but it might work for many... what do you think?

Just an aside - I've gotten the same results - 21MH/s vs. 25MH/s. It's frustrating - but all I've tried is the lookup table implementation, so far.

Well, that means there is probably little room for improvements on that kind of implementation.
I'm curious to see if a bitslice version can be faster on AMD gpus, but I have no time (and no interest because of negative revenue) to try it myself.

Topic: [ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 15. (Read 61261 times)