[ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 5.

MaxDZ8

hero member

Activity: 672

Merit: 500

That's some truly slick updates!

I was indeed planning to do full AES round without t-tables as the amount of masks are nonsensical.
I had the impression the SALU was immensely updated for Tonga given it takes much more VGPRs on the analyzer.

I wonder how to trick the CL compiler in emitting this code.

But most importantly, what are they waiting to just make an AMD_GCN_swizzle extension!?

realhet

newbie

Activity: 32

Merit: 0

Hi,

Have you checked the new GCN3 ISA manual? http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/07/AMD_GCN3_Instruction_Set_Architecture.pdf

It has some really useful things like:

- Bytepermute (no more shifts and masks)
- VOP_DPP: It actually does 2 ds_swizzle in the instruction in no time, so optimizing a single thread for 4 lanes costs no more cycles.
- VOP_SDWA: access a word or a byte in the 32bit inputs and in the output too. (again: no more shifts and masks)
- S alu can write memory

No 3 op add, and 3 op bitwise, though.

And they altered some instruction encodings, so I guess my asm will crash on GCN3 immediately. Cheesy

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: pallas on April 08, 2015, 06:10:51 AM

Quote from: sp_ on April 08, 2015, 05:55:23 AM

Good work on the groest. Smolens quark miner does around 2 mhash on the 280x.
My gtx 980 does 20mhash. The competition is sleeping...

I think that just applying some well known tricks, already available on public kernels, will bring quark hashrate to around 10.
Thing is, it's not funny. Optimizing single kernel algos is much more interesting, imho.

I agree about the easy Quark speedups. Doing this was fun, doing the foundation for Quark would be simply boring. Sure, it would get fun later, when working on one algo at a time, but to get there...

smolen

hero member

Activity: 524

Merit: 500

Quote from: pallas on April 16, 2015, 10:27:14 AM

Just wanted to say I've tried applying some of the tricks I learnt working on whirlpoolx to the groestl kernel, but it's not so simple.
This kernel is much bigger in size so you can't just copy some good lines of code and it runs faster. Furthermore some of the optimizations I made in the past, make it more time consuming to apply some apparently simple hacks. Wolf0 I'm sure you know what I mean ;-)
Still there is room for improvement, I have some ideas, but the question is: when the profit is gone, and the fun is gone, is it still worth?

Another trick, not for speed, but for cleaning the code - when you want to postpone sboxing of byte, put preimage of zero (0x81 in Whirlpool) there.

utahjohn

hero member

Activity: 630

Merit: 500

I get 26MHs on 280x mining groestl however I have quit groestl mining of DMD for the moment till diff drops back into the teens. For some reason ASM kernel crashes 7950 within a few minutes ... I am mining nneoscrypt on yaamp at present and also selling neo on westhash

Buying more DMD than I used to mine direct Huh

Will see what happens in next week or so as miners drop like flies on DMD ...

utahjohn

hero member

Activity: 630

Merit: 500

Quote from: ?? on ??

Quote from: utahjohn on April 16, 2015, 10:03:22 AM

Quote from: Wolf0 on April 16, 2015, 12:49:22 AM

Quote from: utahjohn on April 16, 2015, 12:33:39 AM

@wolf0 do you have anything better than the neocrypt kernel u leaked on feathercoin thread? I am getting 278KHs on 7950 and 295 on 280x

I didn't leak that, I released it. Checking my records...

EDIT: Okay, most recent record of Neoscrypt I have is 12/23/2014 (NSFW): https://ottrbutt.com/miner/neoscryptwolf-12232014.png

Needless to say but I will, I appreciate your work, I have no conception of wavefronts and such, I have tried but I'm just too old to embrace new concepts. If you have something better for me please do put on Mega

Same goes for groestl Pallas

U are my heroes

And realhet who understands AMD GPU coding better than all of us

realhet hetpas assembly kernel still best for 280x and other Tahiti cards AFAIK

Nope, I have 21MH/s out of a 7950 at 1125/1250, IIRC, using OpenCL.

Wow! may I have new Neoscrypt kernel, 7950 working hard just doing 278KHs with your older kernel!

Looking ... 1160/1500

I have modded card a bit for better cooling

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: pallas on April 03, 2015, 06:26:39 AM

Quote from: utahjohn on April 03, 2015, 06:20:07 AM

Quote from: pallas on April 03, 2015, 05:36:47 AM

Quote from: utahjohn on April 03, 2015, 12:49:17 AM

Any chance of getting your latest OCL source to try on 280x (Hawaii)

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

Doh, yeah Tahiti 2 wavefronts not possible?

Both me and Wolf0 tried that and (at least for me) stopped trying after a while. Funny no longer ;-)

Just did it. Cheesy

2 waves in flight on Tahiti, for 21MH/s on a 7950 @ 1125/1250. Screenshot (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-04082015.png

utahjohn

hero member

Activity: 630

Merit: 500

Quote from: pallas on April 16, 2015, 10:27:14 AM

Just wanted to say I've tried applying some of the tricks I learnt working on whirlpoolx to the groestl kernel, but it's not so simple.
This kernel is much bigger in size so you can't just copy some good lines of code and it runs faster. Furthermore some of the optimizations I made in the past, make it more time consuming to apply some apparently simple hacks. Wolf0 I'm sure you know what I mean ;-)
Still there is room for improvement, I have some ideas, but the question is: when the profit is gone, and the fun is gone, is it still worth?

I expect DMD to drop into low teens difficulty after a week or so

If it does not mining is dead LOL. I have a direct interest in this as a partner on donkypool ... 12 miners up from 6 a few weeks ago ... I am currently mining neoscrypt for sale on westhash lol and p=4.8 selling

anything less goes to yaamp ...

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: pallas on April 08, 2015, 04:35:38 AM

Quote from: Wolf0 on April 08, 2015, 04:11:38 AM

Quote from: pallas on April 03, 2015, 06:26:39 AM

Quote from: utahjohn on April 03, 2015, 06:20:07 AM

Quote from: pallas on April 03, 2015, 05:36:47 AM

Quote from: utahjohn on April 03, 2015, 12:49:17 AM

Any chance of getting your latest OCL source to try on 280x (Hawaii)

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

Doh, yeah Tahiti 2 wavefronts not possible?

Both me and Wolf0 tried that and (at least for me) stopped trying after a while. Funny no longer ;-)

I think I've beaten your ASM with pure OpenCL on 290X.

Some of your last tips (and smolen's) can be applied to this kernel as well, I think it can reach 38/40 Mh/s ;-)

Possibly - but turns out, it didn't reach the speed I thought it did; it's still slightly under. Damn.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Just wanted to say I've tried applying some of the tricks I learnt working on whirlpoolx to the groestl kernel, but it's not so simple.
This kernel is much bigger in size so you can't just copy some good lines of code and it runs faster. Furthermore some of the optimizations I made in the past, make it more time consuming to apply some apparently simple hacks. Wolf0 I'm sure you know what I mean ;-)
Still there is room for improvement, I have some ideas, but the question is: when the profit is gone, and the fun is gone, is it still worth?

utahjohn

hero member

Activity: 630

Merit: 500

Quote from: Wolf0 on April 16, 2015, 12:49:22 AM

Quote from: utahjohn on April 16, 2015, 12:33:39 AM

@wolf0 do you have anything better than the neocrypt kernel u leaked on feathercoin thread? I am getting 278KHs on 7950 and 295 on 280x

I didn't leak that, I released it. Checking my records...

EDIT: Okay, most recent record of Neoscrypt I have is 12/23/2014 (NSFW): https://ottrbutt.com/miner/neoscryptwolf-12232014.png

Needless to say but I will, I appreciate your work, I have no conception of wavefronts and such, I have tried but I'm just too old to embrace new concepts. If you have something better for me please do put on Mega

Same goes for groestl Pallas

U are my heroes

And realhet who understands AMD GPU coding better than all of us

realhet hetpas assembly kernel still best for 280x and other Tahiti cards AFAIK

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: pallas on April 03, 2015, 06:26:39 AM

Quote from: utahjohn on April 03, 2015, 06:20:07 AM

Quote from: pallas on April 03, 2015, 05:36:47 AM

Quote from: utahjohn on April 03, 2015, 12:49:17 AM

Any chance of getting your latest OCL source to try on 280x (Hawaii)

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

Doh, yeah Tahiti 2 wavefronts not possible?

Both me and Wolf0 tried that and (at least for me) stopped trying after a while. Funny no longer ;-)

I think I've beaten your ASM with pure OpenCL on 290X.

utahjohn

hero member

Activity: 630

Merit: 500

@wolf0 do you have anything better than the neocrypt kernel u leaked on feathercoin thread? I am getting 278KHs on 7950 and 295 on 280x

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: sp_ on April 13, 2015, 06:38:45 AM

Pallas, can you rewrite this groesl-256 implementation to a groestl-512 and add it to sgminer (x11,x13,x15).?

Sorry for the delay.
That would be nice, but everybody's using wolf0's binaries, so why? It would make sense if there is a plan to opensource optimized versions of most of the algos.

smolen

hero member

Activity: 524

Merit: 500

Quote from: sp_ on April 13, 2015, 08:53:50 AM

Wolf0 claims to know aes from the inside backwords and forwards. Me too.

The answer is SEA :-)

Yes, that makes the game damn addictive.
Look, you told us about wide tables, great idea, but to skip sboxing with it couple more inches deeper inside AES is needed

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: smolen on April 08, 2015, 03:04:49 PM

Some of competitors are awake, taking exercises with pen and paper to get all AES-wannabees at once Cheesy

Wolf0 claims to know aes from the inside backwords and forwards. Me too.

The answer is SEA :-)

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Pallas, can you rewrite this groesl-256 implementation to a groestl-512 and add it to sgminer (x11,x13,x15).?

utahjohn

hero member

Activity: 630

Merit: 500

The need for an improved goestl kernel is now immediate ... please do what u can ... I am just C, C++ coder and am not fully into multi thread GPU coding ...

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: pallas on March 16, 2015, 04:02:43 PM

Quote from: Wolf0 on March 16, 2015, 02:59:02 PM

Quote from: smolen on March 16, 2015, 02:54:01 PM

Quote from: Wolf0 on March 16, 2015, 10:30:25 AM

Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.

Have you rotated table values left by 3 bits? Wink

Not sure it will help with register usage through...

Rotations seem to hurt reg usage a bit. The source REALLY needs cleaning, but IMO, it's rather well done code by Pallas. I'm not really used to seeing anyone with a semblance of clue doing AMD miners. Tongue

Now I've put some parts of the code (ex. the list of rbtts) in pragma unrolled for loops and it looks much better ;-)

Nice - now, I haven't tried this, so the OpenCL compiler may mangle the shit out of it (unpredictable little fucker) - but you have vector types. They look a lot nicer than for loops. Tongue

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: smolen on April 08, 2015, 04:21:19 PM

Quote from: Wolf0 on April 08, 2015, 03:46:11 PM

Quote from: smolen on April 08, 2015, 03:27:08 PM

Quote from: Wolf0 on April 08, 2015, 03:23:55 PM

That's about as far as my parse got before I went, "Is that a fucking NULL pointer dereference?"

Yes

Indexed address is calculated in bitselect. LUT0 and LUT4 indexing is just single AND operation.
EDIT: Oh, wait UINT8 is byte, not int vector. I probably went too far redefining every type

Okay... I'm guessing that you've removed bits from the tables and are regenerating them on the fly, but I can't quite figure out how. Then again, bitwise ops aren't really my best subject...

Tables are constant, just prerotated left by 3 bit (size of one uint2 when used as index). Well, this stuff needs comments, if kernel will be published. Money are in X11 and Monero, not so much value in Whirlpool code, I could just drop it somewhere, but it will give everyone free boost in X11 Sad

Maybe not: people are using wolf0's precompiled x11 binaries, just adding your trick to stock kernels will not come close to them speed-wise.

Topic: [ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 5. (Read 61276 times)