[ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 6.

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: smolen on March 16, 2015, 02:54:01 PM

Quote from: Wolf0 on March 16, 2015, 10:30:25 AM

Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.

Have you rotated table values left by 3 bits? Wink

Not sure it will help with register usage through...

Rotations seem to hurt reg usage a bit. The source REALLY needs cleaning, but IMO, it's rather well done code by Pallas. I'm not really used to seeing anyone with a semblance of clue doing AMD miners. Tongue

smolen

hero member

Activity: 524

Merit: 500

Quote from: Wolf0 on April 08, 2015, 03:46:11 PM

Quote from: smolen on April 08, 2015, 03:27:08 PM

Quote from: Wolf0 on April 08, 2015, 03:23:55 PM

That's about as far as my parse got before I went, "Is that a fucking NULL pointer dereference?"

Yes

Indexed address is calculated in bitselect. LUT0 and LUT4 indexing is just single AND operation.
EDIT: Oh, wait UINT8 is byte, not int vector. I probably went too far redefining every type

Okay... I'm guessing that you've removed bits from the tables and are regenerating them on the fly, but I can't quite figure out how. Then again, bitwise ops aren't really my best subject...

Tables are constant, just prerotated left by 3 bit (size of one uint2 when used as index). Well, this stuff needs comments, if kernel will be published. Money are in X11 and Monero, not so much value in Whirlpool code, I could just drop it somewhere, but it will give everyone free boost in X11 Sad

smolen

hero member

Activity: 524

Merit: 500

Quote from: Wolf0 on April 08, 2015, 03:23:55 PM

That's about as far as my parse got before I went, "Is that a fucking NULL pointer dereference?"

Yes

Indexed address is calculated in bitselect. LUT0 and LUT4 indexing is just single AND operation.
EDIT: Oh, wait UINT8 is byte, not int vector. I probably went too far redefining every type

X64/ASX64 macros keep code debugable on CPU - MSVC is too handy

Code:

#ifdef __OPENCL_VERSION__
#define X64 uint2
#define ASX64(v) (as_uint2(v))
#else
#define X64 UINT64
#define ASX64(v) (v)
#endif

smolen

hero member

Activity: 524

Merit: 500

Quote from: pallas on April 08, 2015, 09:23:06 AM

I was wondering if us (miner developers) should unite to take the best out of it.

Cartel will take all the fun out of game and possibly destroy PoW world. On the other hand, PoS landscape could benefit from some polishing

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: realhet on March 16, 2015, 10:33:40 AM

Quote from: smolen on March 05, 2015, 11:26:47 PM

Quote from: Heavyiron on March 01, 2015, 02:06:44 PM

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.

Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

Hi,

Because of my curiosity I really had to check that bitsliced code Cheesy

and well... I must say that NV has better instructions to do it:
__byte_perm(x, 0, 1010)>>s: this could be emulated by an AND and a MAD24 and az SHR. 3 instead of 2 cycle.
__byte_perm(x, 0, 3232)>>s: SHR, MAD24, SHR also 3 instead of 2.
__byte_perm(x, y, 5410) : SHL, BFE 2 instead of 1 instr. (Even the Intel SSE has many instructions for these things since ages :S)
And there are lots of bitwise logical instructions where NV is 2x faster because NV has a 3 op logic instruction with all the possible 16*16 logic operator combinations.
There are shuffling between 4 lanes: That is not a problem on GCN with ds_swizzle, otherwise it needs LDS on OpenCL.
I've just checked the GCN 1.3 ISA manual and (at least there) I haven't found byte_swizzle and no 3 operand logic instructions either.

Anyways, It would be interesting that how this totally different approach can perform compared to the table based one.

Well, I've done Whirlpool-512 with no lookups at all, and it kinda sucks on GPU. It'll probably be a beast on FPGA, though!

smolen

hero member

Activity: 524

Merit: 500

Quote from: sp_ on April 08, 2015, 05:55:23 AM

Good work on the groest. Smolens quark miner does around 2 mhash on the 280x.
My gtx 980 does 20mhash. The competition is sleeping...

Some of competitors are awake, taking exercises with pen and paper to get all AES-wannabees at once Cheesy

Doing it all by hand, algo by algo will be just boring.

Quote from: sp_ on April 08, 2015, 09:17:44 AM

15 years ago I worked for a company in the silicon valley. My collegues earned xxx.xxx$ a year but I was a student at san francisco state u.

Quote from: sp_ on April 08, 2015, 09:34:17 AM

I've also lived and worked in st. Petersburg Russia. My collegues are some of the best programmers in the world.

Triangulated

Quote from: pallas on April 08, 2015, 04:35:38 AM

Some of your last tips (and smolen's) can be applied to this kernel as well, I think it can reach 38/40 Mh/s ;-)

Last but one trick in my WhirlpoolX kernel. Anyway, I'm going to abandon table approach, no much sense to keep it secret.

Code:

static const CONSTANT UINT64 arrPrecalc_post_l27[256] = ...
#define baseL27 ((UINT32)&arrPrecalc_post_l27[0])
#define TC0off8_l27(off8) (*(const CONSTANT UINT64*)&(((const CONSTANT UINT8*)0)[off8]))
#define LUT3_r3(v) ASX64(TC0off8_l27(bitselect(baseL27, (UINT32)(as_ulong(v) >> 24), 0x7F8U)))

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Nah. I don't need to work.. My program is making money

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: pallas on March 16, 2015, 08:12:34 AM

Quote from: Wolf0 on March 16, 2015, 07:53:20 AM

Quote from: pallas on March 16, 2015, 07:46:38 AM

Quote from: Wolf0 on March 16, 2015, 07:26:56 AM

Quote from: pallas on March 09, 2015, 03:45:42 AM

Quote from: Wolf0 on March 06, 2015, 02:25:15 PM

Quote from: pallas on March 06, 2015, 04:04:19 AM

Quote from: smolen on March 05, 2015, 11:26:47 PM

Quote from: Heavyiron on March 01, 2015, 02:06:44 PM

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.

Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)

As I said - I noticed. However, notice the 280X speeds? You haven't been able to create binaries that good for any chip but Hawaii, AFAIK.

I do not have the card so I can't test it, but I know that on hawaii it can use two wavefronts, but only 1 on tahiti.
Does your kernel run 2 wavefronts on tahiti, as the asm version does?

Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.

qwep1

hero member

Activity: 610

Merit: 500

Quote from: pallas on April 08, 2015, 09:52:42 AM

Quote from: sp_ on April 08, 2015, 09:34:17 AM

I've also lived and worked in st. Petersburg Russia. My collegues are some of the best programmers in the world.

Ok but it looks like you mistaken this thread for a job search one :-D

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer