Wait wait wait. Are we sure uint16 is such a good idea? Last time I tried >4 (which was before 2.6, btw, I haven't tested with 2.6), it would crash in the compiler. Also, does anyone have a count on the number of registers per CU? There might not be enough registers to handle that.
I'm not sure if it's a good idea or not so I wanted to measure it
GCN has 64KB worth of registers per CU, and like you said I'm not sure if that's enough. The reason for my curiosity was because GCN's compute units each contain 4 x SIMD units with a width of 16 elements (same size as Larrabee & Intel's MIC, coincidentally), and I recall reading somewhere that each of these SIMD units can retire one 16-way instruction every 4 cycles, so those 16element vectors kind of rang out at me. I also wanted to get familiar with the OpenCL bitcoin mining code and thought it would be a neat exercise (which it was!). Nice code by the way.
I can say for sure that 16element vectors DO compile with the drivers that came with the card.
The -ds code dump for 16 element vectors came out nice and clean, although the last few lines where the result is stored in output seem a bit branchy. It looks something like this:
if(XG2.s0 == 0x136032ED) { output[Xnonce.s0 & 0xF] = Xnonce.s0; }
if(XG2.s1 == 0x136032ED) { output[Xnonce.s1 & 0xF] = Xnonce.s1; }
if(XG2.s2 == 0x136032ED) { output[Xnonce.s2 & 0xF] = Xnonce.s2; }
...
...
if(XG2.sd == 0x136032ED) { output[Xnonce.sd & 0xF] = Xnonce.sd; }
if(XG2.se == 0x136032ED) { output[Xnonce.se & 0xF] = Xnonce.se; }
if(XG2.sf == 0x136032ED) { output[Xnonce.sf & 0xF] = Xnonce.sf; }
I tried replacing it with a branch-less expression using shuffle() and vstore16() but haven't managed to get it working. What I've come up with looks something like this:
x mask = Xnonce & 0xF;
x temp = shuffle(select(Xnonce, 0, selection), mask);
vstore16(temp, 0, output);
Anyhow I'm sure that my code modifications are doing all sorts of dumb things. I'm still learning how it all works so please ignore.
Also, check some of the larger -vs, -v 40 is two sets of uint4 and -v 44 does three uint4s (unlike cgminer, -v 4 does two uint2s).
I've tried all of the different -v settings available (according to the source) but haven't been able to get any higher than the 666MH/s with the default settings and 3 compute threads.