Topic: [ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 7. (Read 61261 times)

Quote from: Wolf0 on April 08, 2015, 04:11:38 AM

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

Yes, transpose, do bitsliced calculation and transpose back, that will work. Does GCN have something like PMOVMSKB?

I don't think so.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: pallas on April 03, 2015, 06:26:39 AM

Quote from: utahjohn on April 03, 2015, 06:20:07 AM

Quote from: pallas on April 03, 2015, 05:36:47 AM

Quote from: utahjohn on April 03, 2015, 12:49:17 AM

Any chance of getting your latest OCL source to try on 280x (Hawaii)

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

Doh, yeah Tahiti 2 wavefronts not possible?

Both me and Wolf0 tried that and (at least for me) stopped trying after a while. Funny no longer ;-)

I think I've beaten your ASM with pure OpenCL on 290X.

Some of your last tips (and smolen's) can be applied to this kernel as well, I think it can reach 38/40 Mh/s ;-)

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: Heavyiron on March 01, 2015, 02:06:44 PM

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.

Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: utahjohn on April 03, 2015, 06:20:07 AM

Quote from: pallas on April 03, 2015, 05:36:47 AM

Quote from: utahjohn on April 03, 2015, 12:49:17 AM

Any chance of getting your latest OCL source to try on 280x (Hawaii)

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

Doh, yeah Tahiti 2 wavefronts not possible?

Both me and Wolf0 tried that and (at least for me) stopped trying after a while. Funny no longer ;-)

utahjohn

hero member

Activity: 630

Merit: 500

Quote from: pallas on April 03, 2015, 05:36:47 AM

Quote from: utahjohn on April 03, 2015, 12:49:17 AM

Any chance of getting your latest OCL source to try on 280x (Hawaii)

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

Doh, yeah Tahiti 2 wavefronts not possible?

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: utahjohn on April 03, 2015, 12:49:17 AM

Any chance of getting your latest OCL source to try on 280x (Hawaii)

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

utahjohn

hero member

Activity: 630

Merit: 500

Any chance of getting your latest OCL source to try on 280x (Hawaii)

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: pallas on March 05, 2015, 03:45:32 AM

Quote from: Wolf0 on March 05, 2015, 02:16:49 AM

Quote from: HR on February 28, 2015, 04:35:46 AM

Pallas,

Are you planning on adding myriad-groestl support in the future? If not, could you explain why not? Is it because your groestl kernel is already faster than the myriad-groestl?

Also, are you planning on putting your work on github? Again, if not, could you explain why not?

It seems to me that both are important ways to further your efforts and establish your reputation.

Best regards as always.

HR

Myr-Groestl must do SHA256 as well, IIRC - of course pure Groestl is faster.

myr-groestl should be faster because its has a single round of groestl (14 iterations) + sha; groestlcoin is groestl + groestl again, so slower.
it's just that I do not have enough free time to work on all these algos.....
Now wolf0 just did a fantastic job on whirlpoolx and I want to understand the magic ;-)

Haha, you ain't seen impressive yet! Check the thread, I'm about to post again!

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: Wolf0 on March 16, 2015, 02:59:02 PM

Quote from: smolen on March 16, 2015, 02:54:01 PM

Quote from: Wolf0 on March 16, 2015, 10:30:25 AM

Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.

Have you rotated table values left by 3 bits? Wink

Not sure it will help with register usage through...

Rotations seem to hurt reg usage a bit. The source REALLY needs cleaning, but IMO, it's rather well done code by Pallas. I'm not really used to seeing anyone with a semblance of clue doing AMD miners. Tongue

Now I've put some parts of the code (ex. the list of rbtts) in pragma unrolled for loops and it looks much better ;-)

smolen

hero member

Activity: 524

Merit: 500

Quote from: Wolf0 on March 16, 2015, 10:30:25 AM

Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.

Have you rotated table values left by 3 bits? Wink

Not sure it will help with register usage through...

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: HR on February 28, 2015, 04:35:46 AM

Pallas,

Are you planning on adding myriad-groestl support in the future? If not, could you explain why not? Is it because your groestl kernel is already faster than the myriad-groestl?

Also, are you planning on putting your work on github? Again, if not, could you explain why not?

It seems to me that both are important ways to further your efforts and establish your reputation.

Best regards as always.

HR

Myr-Groestl must do SHA256 as well, IIRC - of course pure Groestl is faster.

smolen

hero member

Activity: 524

Merit: 500

Quote from: Wolf0 on March 07, 2015, 06:30:45 PM

Quote from: smolen on March 06, 2015, 10:55:11 PM

Yes, transpose, do bitsliced calculation and transpose back, that will work. Does GCN have something like PMOVMSKB?

I don't think so.

May be VCC (vector condition code) will do the trick, so normal and bitsliced operations could be cheaply interleaved

Quote from: realhet on March 16, 2015, 10:33:40 AM

NV has a 3 op logic instruction with all the possible 16*16 logic operator combinations.
I've just checked the GCN 1.3 ISA manual and (at least there) I haven't found byte_swizzle and no 3 operand logic instructions either.

Yes, AMD's GCN is overplayed by VPTERNLOGD and VPTERNLOGQ from Intel AVX512 and LOP3.LUT by NVidia Sad

realhet

newbie

Activity: 32

Merit: 0

Quote from: Heavyiron on March 01, 2015, 02:06:44 PM

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.

Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

Hi,

Because of my curiosity I really had to check that bitsliced code Cheesy

and well... I must say that NV has better instructions to do it:
__byte_perm(x, 0, 1010)>>s: this could be emulated by an AND and a MAD24 and az SHR. 3 instead of 2 cycle.
__byte_perm(x, 0, 3232)>>s: SHR, MAD24, SHR also 3 instead of 2.
__byte_perm(x, y, 5410) : SHL, BFE 2 instead of 1 instr. (Even the Intel SSE has many instructions for these things since ages :S)
And there are lots of bitwise logical instructions where NV is 2x faster because NV has a 3 op logic instruction with all the possible 16*16 logic operator combinations.
There are shuffling between 4 lanes: That is not a problem on GCN with ds_swizzle, otherwise it needs LDS on OpenCL.
I've just checked the GCN 1.3 ISA manual and (at least there) I haven't found byte_swizzle and no 3 operand logic instructions either.

Anyways, It would be interesting that how this totally different approach can perform compared to the table based one.

iju76

full member

Activity: 194

Merit: 100

win7-64 -- sgminer-5-dev-neoscrypt-windows-new2 -- dr-14.7

http://s001.radikal.ru/i194/1503/f3/09a2627a6270.png

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: Wolf0 on March 16, 2015, 07:53:20 AM

Quote from: pallas on March 16, 2015, 07:46:38 AM

Quote from: Wolf0 on March 16, 2015, 07:26:56 AM

Quote from: pallas on March 09, 2015, 03:45:42 AM

Quote from: Heavyiron on March 01, 2015, 02:06:44 PM

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.

Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)

As I said - I noticed. However, notice the 280X speeds? You haven't been able to create binaries that good for any chip but Hawaii, AFAIK.

I do not have the card so I can't test it, but I know that on hawaii it can use two wavefronts, but only 1 on tahiti.
Does your kernel run 2 wavefronts on tahiti, as the asm version does?

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: Wolf0 on March 16, 2015, 07:26:56 AM

Quote from: pallas on March 09, 2015, 03:45:42 AM

Quote from: Heavyiron on March 01, 2015, 02:06:44 PM

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.

Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: Heavyiron on March 01, 2015, 02:06:44 PM

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.

Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

smolen

hero member

Activity: 524

Merit: 500

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

Yes, transpose, do bitsliced calculation and transpose back, that will work. Does GCN have something like PMOVMSKB?

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer