Pages:
Author

Topic: [ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 7. (Read 61261 times)

member
Activity: 81
Merit: 1002
It was only the wind.
It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.
Yes, transpose, do bitsliced calculation and transpose back, that will work. Does GCN have something like PMOVMSKB?

I don't think so.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Any chance of getting your latest OCL source to try on 280x (Hawaii) Smiley

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

Doh, yeah Tahiti 2 wavefronts not possible?

Both me and Wolf0 tried that and (at least for me) stopped trying after a while. Funny no longer ;-)

I think I've beaten your ASM with pure OpenCL on 290X.

Some of your last tips (and smolen's) can be applied to this kernel as well, I think it can reach 38/40 Mh/s ;-)
member
Activity: 81
Merit: 1002
It was only the wind.
I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Any chance of getting your latest OCL source to try on 280x (Hawaii) Smiley

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

Doh, yeah Tahiti 2 wavefronts not possible?

Both me and Wolf0 tried that and (at least for me) stopped trying after a while. Funny no longer ;-)
hero member
Activity: 630
Merit: 500
Any chance of getting your latest OCL source to try on 280x (Hawaii) Smiley

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.

Doh, yeah Tahiti 2 wavefronts not possible?
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Any chance of getting your latest OCL source to try on 280x (Hawaii) Smiley

I assume you meant Tahiti.
I've acquired a 280x myself: it's not worth using v2 on it, hashrate is lower than with v1.
hero member
Activity: 630
Merit: 500
Any chance of getting your latest OCL source to try on 280x (Hawaii) Smiley
member
Activity: 81
Merit: 1002
It was only the wind.

Pallas,

Are you planning on adding myriad-groestl support in the future? If not, could you explain why not? Is it because your groestl kernel is already faster than the myriad-groestl?

Also, are you planning on putting your work on github? Again, if not, could you explain why not?

It seems to me that both are important ways to further your efforts and establish your reputation.

Best regards as always.

HR

Myr-Groestl must do SHA256 as well, IIRC - of course pure Groestl is faster.

myr-groestl should be faster because its has a single round of groestl (14 iterations) + sha; groestlcoin is groestl + groestl again, so slower.
it's just that I do not have enough free time to work on all these algos.....
Now wolf0 just did a fantastic job on whirlpoolx and I want to understand the magic ;-)

Haha, you ain't seen impressive yet! Check the thread, I'm about to post again!
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.
Have you rotated table values left by 3 bits? Wink Not sure it will help with register usage through...

Rotations seem to hurt reg usage a bit. The source REALLY needs cleaning, but IMO, it's rather well done code by Pallas. I'm not really used to seeing anyone with a semblance of clue doing AMD miners.  Tongue

Now I've put some parts of the code (ex. the list of rbtts) in pragma unrolled for loops and it looks much better ;-)
hero member
Activity: 524
Merit: 500
Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.
Have you rotated table values left by 3 bits? Wink Not sure it will help with register usage through...
member
Activity: 81
Merit: 1002
It was only the wind.

Pallas,

Are you planning on adding myriad-groestl support in the future? If not, could you explain why not? Is it because your groestl kernel is already faster than the myriad-groestl?

Also, are you planning on putting your work on github? Again, if not, could you explain why not?

It seems to me that both are important ways to further your efforts and establish your reputation.

Best regards as always.

HR

Myr-Groestl must do SHA256 as well, IIRC - of course pure Groestl is faster.
hero member
Activity: 524
Merit: 500
Yes, transpose, do bitsliced calculation and transpose back, that will work. Does GCN have something like PMOVMSKB?
I don't think so.
May be VCC (vector condition code) will do the trick, so normal and bitsliced operations could be cheaply interleaved

NV has a 3 op logic instruction with all the possible 16*16 logic operator combinations.
I've just checked the GCN 1.3 ISA manual and (at least there) I haven't found byte_swizzle and no 3 operand logic instructions either.
Yes, AMD's GCN is overplayed by VPTERNLOGD and VPTERNLOGQ from Intel AVX512 and LOP3.LUT by NVidia Sad
newbie
Activity: 32
Merit: 0
I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code
Hi,

Because of my curiosity I really had to check that bitsliced code Cheesy and well... I must say that NV has better instructions to do it:
__byte_perm(x, 0, 1010)>>s:  this could be emulated by an AND and a MAD24 and az SHR. 3 instead of 2 cycle.
__byte_perm(x, 0, 3232)>>s:  SHR, MAD24, SHR   also 3 instead of 2.
__byte_perm(x, y, 5410)      :  SHL, BFE      2 instead of 1 instr.  (Even the Intel SSE has many instructions for these things since ages :S)
And there are lots of bitwise logical instructions where NV is 2x faster because NV has a 3 op logic instruction with all the possible 16*16 logic operator combinations.
There are shuffling between 4 lanes: That is not a problem on GCN with ds_swizzle, otherwise it needs LDS on OpenCL.
I've just checked the GCN 1.3 ISA manual and (at least there) I haven't found byte_swizzle and no 3 operand logic instructions either.

Anyways, It would be interesting that how this totally different approach can perform compared to the table based one.
full member
Activity: 194
Merit: 100
win7-64 -- sgminer-5-dev-neoscrypt-windows-new2 -- dr-14.7

http://s001.radikal.ru/i194/1503/f3/09a2627a6270.png
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)

As I said - I noticed. However, notice the 280X speeds? You haven't been able to create binaries that good for any chip but Hawaii, AFAIK.

I do not have the card so I can't test it, but I know that on hawaii it can use two wavefronts, but only 1 on tahiti.
Does your kernel run 2 wavefronts on tahiti, as the asm version does?
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)
hero member
Activity: 524
Merit: 500
It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.
Yes, transpose, do bitsliced calculation and transpose back, that will work. Does GCN have something like PMOVMSKB?
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.
hero member
Activity: 524
Merit: 500
I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code
Pages:
Jump to: