https://forum.bitcoin.org/index.php?topic=21275.0
First of all, minerd supports up to 4 vectors, and when I add this change to my kernel, it actually _slows down_ the 4 vector version. But when I override it to set 2 vectors, it speeds it up. However, once it's sped up, I then get runs of rejected shares. I tried it multiple times with and without and it does appear to be just this change that causes it, so I'm not sure what's going on.
I honestly do not know what's up with that, I saw ATI asm yesterday for the first time and can't tell you exactly what's wrong yet. All I know is that the truth tables match for the Ma() function with and without my modifications. Yet, here's a couple of ideas -
1 glancing through the doc, radeons are VLIW5 = 4+1, with 4 'normal' pipelines and one transcendental pipe, which can do a restricted set of instructions. I don't know where BFI_INT gets executed, but if it is only in the trans. pipe, then doing too many BFI's can hurt the performance by making that pipe a bottleneck. Check the docs and let us know, if you don't mind.
2 if (z^x) isn't already used in other places in your code, then it may be pushing up the register usage and you're running less threads in parallel. Again, I don't know much about ATI, but it would be the first thing I'd check if we were on nVidia/CUDA.
3 something else altogether...
Not sure. Sorry. If I think of anything else, I'll post it In the meantime, it would also suck if people started getting more rejected shares... hmmm. I don't, it does work for me, but I encourage everyone to check their results (the actual amount of accepted shares that they get).