Pages:
Author

Topic: [ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 5. (Read 61261 times)

hero member
Activity: 672
Merit: 500
That's some truly slick updates!

I was indeed planning to do full AES round without t-tables as the amount of masks are nonsensical.
I had the impression the SALU was immensely updated for Tonga given it takes much more VGPRs on the analyzer.

I wonder how to trick the CL compiler in emitting this code.

But most importantly, what are they waiting to just make an AMD_GCN_swizzle extension!?
newbie
Activity: 32
Merit: 0
Hi,

Have you checked the new GCN3 ISA manual? http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/07/AMD_GCN3_Instruction_Set_Architecture.pdf

It has some really useful things like:

- Bytepermute (no more shifts and masks)
- VOP_DPP: It actually does 2 ds_swizzle in the instruction in no time, so optimizing a single thread for 4 lanes costs no more cycles.
- VOP_SDWA: access a word or a byte in the 32bit inputs and in the output too. (again: no more shifts and masks)
- S alu can write memory

No 3 op add, and 3 op bitwise, though.

And they altered some instruction encodings, so I guess my asm will crash on GCN3 immediately. Cheesy

member
Activity: 81
Merit: 1002
It was only the wind.
Good work on the groest. Smolens quark miner does around 2 mhash on the 280x.
My gtx 980 does 20mhash. The competition is sleeping...

I think that just applying some well known tricks, already available on public kernels, will bring quark hashrate to around 10.
Thing is, it's not funny. Optimizing single kernel algos is much more interesting, imho.

I agree about the easy Quark speedups. Doing this was fun, doing the foundation for Quark would be simply boring. Sure, it would get fun later, when working on one algo at a time, but to get there...
hero member
Activity: 524
Merit: 500
Just wanted to say I've tried applying some of the tricks I learnt working on whirlpoolx to the groestl kernel, but it's not so simple.
This kernel is much bigger in size so you can't just copy some good lines of code and it runs faster. Furthermore some of the optimizations I made in the past, make it more time consuming to apply some apparently simple hacks. Wolf0 I'm sure you know what I mean ;-)
Still there is room for improvement, I have some ideas, but the question is: when the profit is gone, and the fun is gone, is it still worth?
Another trick, not for speed, but for cleaning the code - when you want to postpone sboxing of byte, put preimage of zero (0x81 in Whirlpool) there.
hero member
Activity: 630
Merit: 500
I get 26MHs on 280x mining groestl however I have quit groestl mining of DMD for the moment till diff drops back into the teens.  For some reason ASM kernel crashes 7950 within a few minutes ...  I am mining nneoscrypt on yaamp at present and also selling neo on westhash Smiley
Buying more DMD than I used to mine direct Huh  Will see what happens in next week or so as miners drop like flies on DMD ...
hero member
Activity: 630
Merit: 500
@wolf0 do you have anything better than the neocrypt kernel u leaked on feathercoin thread?  I am getting 278KHs on 7950 and 295 on 280x

I didn't leak that, I released it. Checking my records...

EDIT: Okay, most recent record of Neoscrypt I have is 12/23/2014 (NSFW): https://ottrbutt.com/miner/neoscryptwolf-12232014.png
Needless to say but I will, I appreciate your work, I have no conception of wavefronts and such, I have tried but I'm just too old to embrace new concepts.  If you have something better for me please do put on Mega Smiley  Same goes for groestl Pallas Smiley  U are my heroes Smiley
And realhet who understands AMD GPU coding better than all of us Smiley  realhet hetpas assembly kernel still best for 280x and other Tahiti cards AFAIK Smiley

Nope, I have 21MH/s out of a 7950 at 1125/1250, IIRC, using OpenCL.
Wow! may I have new Neoscrypt kernel, 7950 working hard just doing 278KHs with your older kernel!

Looking ... 1160/1500 Smiley I have modded card a bit for better cooling Smiley
member
Activity: 81
Merit: 1002
It was only the wind.
Any chance of getting your latest OCL source to try on 280x (Hawaii) Smiley

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

Doh, yeah Tahiti 2 wavefronts not possible?

Both me and Wolf0 tried that and (at least for me) stopped trying after a while. Funny no longer ;-)

Just did it. Cheesy

2 waves in flight on Tahiti, for 21MH/s on a 7950 @ 1125/1250. Screenshot (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-04082015.png
hero member
Activity: 630
Merit: 500
Just wanted to say I've tried applying some of the tricks I learnt working on whirlpoolx to the groestl kernel, but it's not so simple.
This kernel is much bigger in size so you can't just copy some good lines of code and it runs faster. Furthermore some of the optimizations I made in the past, make it more time consuming to apply some apparently simple hacks. Wolf0 I'm sure you know what I mean ;-)
Still there is room for improvement, I have some ideas, but the question is: when the profit is gone, and the fun is gone, is it still worth?

I expect DMD to drop into low teens difficulty after a week or so Smiley  If it does not mining is dead LOL.  I have a direct interest in this as a partner on donkypool ... 12 miners up from 6 a few weeks ago ... I am currently mining neoscrypt for sale on westhash lol and p=4.8 selling Smiley  anything less goes to yaamp ...
member
Activity: 81
Merit: 1002
It was only the wind.
Any chance of getting your latest OCL source to try on 280x (Hawaii) Smiley

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

Doh, yeah Tahiti 2 wavefronts not possible?

Both me and Wolf0 tried that and (at least for me) stopped trying after a while. Funny no longer ;-)

I think I've beaten your ASM with pure OpenCL on 290X.

Some of your last tips (and smolen's) can be applied to this kernel as well, I think it can reach 38/40 Mh/s ;-)

Possibly - but turns out, it didn't reach the speed I thought it did; it's still slightly under. Damn.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Just wanted to say I've tried applying some of the tricks I learnt working on whirlpoolx to the groestl kernel, but it's not so simple.
This kernel is much bigger in size so you can't just copy some good lines of code and it runs faster. Furthermore some of the optimizations I made in the past, make it more time consuming to apply some apparently simple hacks. Wolf0 I'm sure you know what I mean ;-)
Still there is room for improvement, I have some ideas, but the question is: when the profit is gone, and the fun is gone, is it still worth?
hero member
Activity: 630
Merit: 500
@wolf0 do you have anything better than the neocrypt kernel u leaked on feathercoin thread?  I am getting 278KHs on 7950 and 295 on 280x

I didn't leak that, I released it. Checking my records...

EDIT: Okay, most recent record of Neoscrypt I have is 12/23/2014 (NSFW): https://ottrbutt.com/miner/neoscryptwolf-12232014.png
Needless to say but I will, I appreciate your work, I have no conception of wavefronts and such, I have tried but I'm just too old to embrace new concepts.  If you have something better for me please do put on Mega Smiley  Same goes for groestl Pallas Smiley  U are my heroes Smiley
And realhet who understands AMD GPU coding better than all of us Smiley  realhet hetpas assembly kernel still best for 280x and other Tahiti cards AFAIK Smiley
member
Activity: 81
Merit: 1002
It was only the wind.
Any chance of getting your latest OCL source to try on 280x (Hawaii) Smiley

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

Doh, yeah Tahiti 2 wavefronts not possible?

Both me and Wolf0 tried that and (at least for me) stopped trying after a while. Funny no longer ;-)

I think I've beaten your ASM with pure OpenCL on 290X.
hero member
Activity: 630
Merit: 500
@wolf0 do you have anything better than the neocrypt kernel u leaked on feathercoin thread?  I am getting 278KHs on 7950 and 295 on 280x
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Pallas, can you  rewrite this groesl-256 implementation to a groestl-512 and add it to sgminer (x11,x13,x15).?

Sorry for the delay.
That would be nice, but everybody's using wolf0's binaries, so why? It would make sense if there is a plan to opensource optimized versions of most of the algos.
hero member
Activity: 524
Merit: 500
Wolf0 claims to know aes from the inside backwords and forwards. Me too.

The answer is SEA :-)
Yes, that makes the game damn addictive.
Look, you told us about wide tables, great idea, but to skip sboxing with it couple more inches deeper inside AES is needed Smiley
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Some of competitors are awake, taking exercises with pen and paper to get all AES-wannabees at once Cheesy

Wolf0 claims to know aes from the inside backwords and forwards. Me too.

The answer is SEA :-)

sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Pallas, can you  rewrite this groesl-256 implementation to a groestl-512 and add it to sgminer (x11,x13,x15).?

hero member
Activity: 630
Merit: 500
The need for an improved goestl kernel is now immediate ... please do what u can ... I am just C, C++ coder and am not fully into multi thread GPU coding ...
member
Activity: 81
Merit: 1002
It was only the wind.
Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.
Have you rotated table values left by 3 bits? Wink Not sure it will help with register usage through...

Rotations seem to hurt reg usage a bit. The source REALLY needs cleaning, but IMO, it's rather well done code by Pallas. I'm not really used to seeing anyone with a semblance of clue doing AMD miners.  Tongue

Now I've put some parts of the code (ex. the list of rbtts) in pragma unrolled for loops and it looks much better ;-)

Nice - now, I haven't tried this, so the OpenCL compiler may mangle the shit out of it (unpredictable little fucker) - but you have vector types. They look a lot nicer than for loops. Tongue
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
That's about as far as my parse got before I went, "Is that a fucking NULL pointer dereference?"
Yes Smiley Indexed address is calculated in bitselect. LUT0 and LUT4 indexing is just single AND operation.
EDIT: Oh, wait UINT8 is byte, not int vector. I probably went too far redefining every type Smiley

Okay... I'm guessing that you've removed bits from the tables and are regenerating them on the fly, but I can't quite figure out how. Then again, bitwise ops aren't really my best subject...
Tables are constant, just prerotated left by 3 bit (size of one uint2 when used as index). Well, this stuff needs comments, if kernel will be published. Money are in X11 and Monero, not so much value in Whirlpool code, I could just drop it somewhere, but it will give everyone free boost in X11 Sad

Maybe not: people are using wolf0's precompiled x11 binaries, just adding your trick to stock kernels will not come close to them speed-wise.
Pages:
Jump to: