That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.
I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)
EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.
EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.
Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)
I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.)
Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it?
SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work.
GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it.
You might not be looking at the big picture with Groestl - look at that fucking shitty amount of waves in flight you get due to LDS (ab)use.
That's an issue with <= tahiti only, hence why I hate optimizing for those chips ;-)
Not the case - two waves in flight, and your kernel is STILL not actually using the GPU's parallelism like it's supposed to be. One Groestl-512 hash is a big job, and it's parallelizable. If you're doing a throughput of 64 hashes per local workgroup, then use 256 for Groestl, and do 4 work-items per actual hash. Tune to taste.
I understand what you mean: it's like the good old cgminer "vector size". I will think about it.
Besides, I haven't worked on groestl for a long while, but on whirlpool and variants I can easily get 3 waves on >= hawaii.
It's a lighter job, I know, but I haven't had any interest in developing groestl recently.
No, it is the OPPOSITE of vector size. You don't get how the GPU is ACTUALLY supposed to solve issues, I don't think - it really doesn't fucking like large code size, or very complex problems in one work-item - you know this.
Vectors were profitable before because of the old architectures - VLIW based. GCN abolished hardware vectors, and instead made VGPRs 4 bytes. Why, you may ask? Occupancy! This way, if you need to work on a problem that can't be efficiently vectorized like that, you don't waste most of your VGPR.
But, but, but... mah parallelism! GCN has you covered - you just need to think of the shit differently. Instead of parallelizing in vectors, do it in work-items. To give you the cleanest example I've worked with demonstrating this (in X11), take Echo-512.
You have a 256 byte state which I'll now refer to as W. W can be represented as an array of 16 uint4s. If you're looking at the shitty darkcoin-mod.cl trying to visualize this, just look at the 64-bit W vars and imagine them as 32-bit, and an array. Now, if I was going to demonstrate this technique with Echo - I have an array of 4 uint4s. This is my W. To figure out which part of the hash you are, you can choose two ways: launch the kernel with throughput * 4, 1, 1 local size, or do throughput, 4, 1 local size. Since the latter is cleaner, I'll assume that notation: lid = get_local_id(0), and hashid = get_local_id(1).
if hashid is < 2 (i.e. 0 or 1) - we fill up W with (512, 0, 0, 0) (uint4, remember) over all four array indices. If hashid == 2, W becomes the input (input being 16 uints, this may be represented as 4 uint4s, as well), and if hashid == 3, we fill up W with the odds & ends - for X11, these are (0x80, 0, 0, 0) for W[0], (0, 0, 0, 0) for W[1], (0, 0, 0, 0x02000000) for W[2], and (512, 0,0, 0,) for W[3]. Now, go pull up darkcoin-mod.cl, and look at it until the this and the previous paragraph make sense.
I'll continue with rounds and output calculation in another post in just a bit.