Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.
I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)
EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.
EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.
Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)
I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.)
Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it?
SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work.
GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it.
You might not be looking at the big picture with Groestl - look at that fucking shitty amount of waves in flight you get due to LDS (ab)use.
That's an issue with <= tahiti only, hence why I hate optimizing for those chips ;-)
Not the case - two waves in flight, and your kernel is STILL not actually using the GPU's parallelism like it's supposed to be. One Groestl-512 hash is a big job, and it's parallelizable. If you're doing a throughput of 64 hashes per local workgroup, then use 256 for Groestl, and do 4 work-items per actual hash. Tune to taste.