Pages:
Author

Topic: [ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 6. (Read 61261 times)

member
Activity: 81
Merit: 1002
It was only the wind.
Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.
Have you rotated table values left by 3 bits? Wink Not sure it will help with register usage through...

Rotations seem to hurt reg usage a bit. The source REALLY needs cleaning, but IMO, it's rather well done code by Pallas. I'm not really used to seeing anyone with a semblance of clue doing AMD miners.  Tongue
hero member
Activity: 524
Merit: 500
That's about as far as my parse got before I went, "Is that a fucking NULL pointer dereference?"
Yes Smiley Indexed address is calculated in bitselect. LUT0 and LUT4 indexing is just single AND operation.
EDIT: Oh, wait UINT8 is byte, not int vector. I probably went too far redefining every type Smiley

Okay... I'm guessing that you've removed bits from the tables and are regenerating them on the fly, but I can't quite figure out how. Then again, bitwise ops aren't really my best subject...
Tables are constant, just prerotated left by 3 bit (size of one uint2 when used as index). Well, this stuff needs comments, if kernel will be published. Money are in X11 and Monero, not so much value in Whirlpool code, I could just drop it somewhere, but it will give everyone free boost in X11 Sad
hero member
Activity: 524
Merit: 500
That's about as far as my parse got before I went, "Is that a fucking NULL pointer dereference?"
Yes Smiley Indexed address is calculated in bitselect. LUT0 and LUT4 indexing is just single AND operation.
EDIT: Oh, wait UINT8 is byte, not int vector. I probably went too far redefining every type Smiley
X64/ASX64 macros keep code debugable on CPU - MSVC is too handy
Code:
#ifdef __OPENCL_VERSION__
#define X64 uint2
#define ASX64(v) (as_uint2(v))
#else
#define X64 UINT64
#define ASX64(v) (v)
#endif
hero member
Activity: 524
Merit: 500
I was wondering if us (miner developers) should unite to take the best out of it.
Cartel will take all the fun out of game and possibly destroy PoW world. On the other hand, PoS landscape could benefit from some polishing Smiley
member
Activity: 81
Merit: 1002
It was only the wind.
I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code
Hi,

Because of my curiosity I really had to check that bitsliced code Cheesy and well... I must say that NV has better instructions to do it:
__byte_perm(x, 0, 1010)>>s:  this could be emulated by an AND and a MAD24 and az SHR. 3 instead of 2 cycle.
__byte_perm(x, 0, 3232)>>s:  SHR, MAD24, SHR   also 3 instead of 2.
__byte_perm(x, y, 5410)      :  SHL, BFE      2 instead of 1 instr.  (Even the Intel SSE has many instructions for these things since ages :S)
And there are lots of bitwise logical instructions where NV is 2x faster because NV has a 3 op logic instruction with all the possible 16*16 logic operator combinations.
There are shuffling between 4 lanes: That is not a problem on GCN with ds_swizzle, otherwise it needs LDS on OpenCL.
I've just checked the GCN 1.3 ISA manual and (at least there) I haven't found byte_swizzle and no 3 operand logic instructions either.

Anyways, It would be interesting that how this totally different approach can perform compared to the table based one.

Well, I've done Whirlpool-512 with no lookups at all, and it kinda sucks on GPU. It'll probably be a beast on FPGA, though!
hero member
Activity: 524
Merit: 500
Good work on the groest. Smolens quark miner does around 2 mhash on the 280x.
My gtx 980 does 20mhash. The competition is sleeping...
Some of competitors are awake, taking exercises with pen and paper to get all AES-wannabees at once Cheesy Doing it all by hand, algo by algo will be just boring.


15 years ago I worked for a company in the silicon valley. My collegues earned xxx.xxx$ a year but I was a student at san francisco state u.
I've also lived and worked in st. Petersburg Russia. My collegues are some of the best programmers in the world.
Triangulated

Some of your last tips (and smolen's) can be applied to this kernel as well, I think it can reach 38/40 Mh/s ;-)

Last but one trick in my WhirlpoolX kernel. Anyway, I'm going to abandon table approach, no much sense to keep it secret.
Code:
static const CONSTANT UINT64 arrPrecalc_post_l27[256] = ...
#define baseL27 ((UINT32)&arrPrecalc_post_l27[0])
#define TC0off8_l27(off8) (*(const CONSTANT UINT64*)&(((const CONSTANT UINT8*)0)[off8]))
#define LUT3_r3(v) ASX64(TC0off8_l27(bitselect(baseL27, (UINT32)(as_ulong(v) >> 24), 0x7F8U)))
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Nah.  I don't need to work.. My program is making money
member
Activity: 81
Merit: 1002
It was only the wind.
I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)

As I said - I noticed. However, notice the 280X speeds? You haven't been able to create binaries that good for any chip but Hawaii, AFAIK.

I do not have the card so I can't test it, but I know that on hawaii it can use two wavefronts, but only 1 on tahiti.
Does your kernel run 2 wavefronts on tahiti, as the asm version does?

Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.
hero member
Activity: 610
Merit: 500
I've also lived and worked in st. Petersburg Russia. My collegues are some of the best programmers in the world.

Ok but it looks like you mistaken this thread for a job search one :-D
Grin Grin Grin
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
I've also lived and worked in st. Petersburg Russia. My collegues are some of the best programmers in the world.

Ok but it looks like you mistaken this thread for a job search one :-D
member
Activity: 81
Merit: 1002
It was only the wind.
I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)

As I said - I noticed. However, notice the 280X speeds? You haven't been able to create binaries that good for any chip but Hawaii, AFAIK.

I do not have the card so I can't test it, but I know that on hawaii it can use two wavefronts, but only 1 on tahiti.
Does your kernel run 2 wavefronts on tahiti, as the asm version does?

Mine's got 2 waves in flight on Hawaii - I believe editing to get another wave in flight on Tahiti and Pitcairn should be simple, stand by.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
I've also lived and worked in st. Petersburg Russia. My collegues are some of the best programmers in the world.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Today i earn $xxx.xxx  a year. Optimizing is just a hobby..

same for me.
still, if a fun job also remunerates, it's even better ;-)
member
Activity: 81
Merit: 1002
It was only the wind.
I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)

As I said - I noticed. However, notice the 280X speeds? You haven't been able to create binaries that good for any chip but Hawaii, AFAIK.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Today i earn $xxx.xxx  a year. Optimizing is just a hobby..
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
15 years ago I worked for a company in the silicon valley. My collegues earned xxx.xxx$ but I was a student at san francisco state u.

20 years ago I started programming professionally.
Still, a lot of my work is free or almost free :-)
I was wondering if us (miner developers) should unite to take the best out of it.
member
Activity: 81
Merit: 1002
It was only the wind.
I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
15 years ago I worked for a company in the silicon valley. My collegues earned xxx.xxx$ a year but I was a student at san francisco state u.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Good work on the groest. Smolens quark miner does around 2 mhash on the 280x.
My gtx 980 does 20mhash. The competition is sleeping...

I think that just applying some well known tricks, already available on public kernels, will bring quark hashrate to around 10.
Thing is, it's not funny. Optimizing single kernel algos is much more interesting, imho.
member
Activity: 81
Merit: 1002
It was only the wind.
It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.
Yes, transpose, do bitsliced calculation and transpose back, that will work. Does GCN have something like PMOVMSKB?

I don't think so.
Pages:
Jump to: