Hi again,
Finally I'm at the point that it first time ever produced a correct result.
The speed test was surprisingly good: HD7770 1000MHz (640 streams, GCN1.0 chip), Cat:14.9(the 20% slower driver), total workitems: 256*10*512, elapsed: 558.613 ms, 4.693 MH/s, gain:
1.17x where the baseline is the opencl implementation (found on amd.com at Wolf0's post) which is 4.00MH/s.
And the first optimization was really a cheap shot
. Unlike ocl, I was able to made it under 128 VGPRS (I use 120 currently, it was kinda close). So as each Vector ALU can choose from 2 wavefronts at any time, latency hiding finally kicked in -> elapsed: 279.916 ms 9.365 MH/s gain:
2.34xAnd I'm full of ideas to try
Next will be to shrink the code to fit into the 32KB instruction cache. Now it is 300kb, it's a massive macro unroll at the moment. The original pallas' ocl version is 110kb, wonder why 3x the multiplier though. Anyways, on GCN we can have loops with only 1 clycle overheads, or even I can write subroutines with call/ret instructions, so I gotta try that fast it is when the instruction cache has no misses at all.
OpenCL thing: While I simplify the code (I chopped down the first/last round optimizations because they would be hard to implement in asm atm) I noticed that I knew already from the past: The OpenCL->llvm-> amd_il -> gcn_asm toolchain will eliminate all the constant calculations and all the calculations whose results is not used at all. I watched the times while making these modifications and it stayed around 4MH/s. Sometimes it dropped below 3.7 when I put measurement code at various places to compare the original kernel with my kernel: if(gid==1234 && flag==1) for(int i=0; i<16; ++i) output
= g;
Great progress, very interesting!
The first improvement, 1.17x, is about the same as the 20% that is lost on 14.9 compared to 14.6 beta, so the two implementations are equivalent.
The second, 2.34x, is really impressive: I have tried multiple times to reduce the number of variables as much as possible (down to 3x16 ulong arrays, 2 ulong and 2 uint), but the results were always worse, so probably that improvement can't be implemented in opencl, or at least I don't know how to.
The same for code size and instruction cache: I was able to squeeze it to about 50K, but at a speed loss.
About the compiler than can eliminate the constant calculations: I noticed that, but doing it by hand works best both in terms of speed and kernel size.
Finally, a question about your work: do you plan to opensource it?