New OpenCL Kernel for Myriad-Groestl (DGB, MYR, etc.) - page 2.

prichina

newbie

Activity: 82

Merit: 0

7970 is doing fine with sgminer 5.1.1. - 35 mh/s ....but my R9 290X is bonkers, only 40-42 mh/s is very low :/ Can someone help, pls send me a kernel or bin so i can make it work...I've tryied ghostlenders myriad-groestl.cl ...still 7970 - 35 mh/s and R9 290X - 40-42 mh/s :/ If someone needs work with video editing, i can make it happen....privat message me, Kind Regards Ivo Icevski

navydude

sr. member

Activity: 283

Merit: 250

Wondering if someone would compile this for windows. Would be much appreciated!

Tmdz

hero member

Activity: 1008

Merit: 1000

nice work 7950 went from 7 mh to 27 mh, but I think mining the skein will still earn you more with dgb.

On the technical side that kind of efficiency improvement is simply amazing. Shocked

Koltan

newbie

Activity: 57

Merit: 0

Radeon HD7790 1200/1600 mining MYR got 17.5 Mh on this kernel. It's two times faster than the original Shocked

proctologic

sr. member

Activity: 241

Merit: 250

Quote from: depboy on October 18, 2015, 05:56:49 AM

Quote from: djm34 on October 18, 2015, 05:02:24 AM

Quote from: depboy on October 18, 2015, 04:49:42 AM

I don't understand why you'd be using a 290 to mine Myriad-Groestl on linux? Wrong algo.

huh ?
I don't understand why you are posting that ? wrong or random answer...

AFAIK, Myriad-Groestl is only used by DGB and MYR. And if you're gonna mine either of those coins with a 290 (I have half a dozen 290s), Skein is by far the better algo for that particular gpu. See https://bitcointalksearch.org/topic/skein-pimp-profit-switching-dgbmyr-merged-mining-uis-inc-p2pool-nodes-1186670 for more details.

Trinity to http://coinspool.cu.cc/workers_trinity

https://bitcointalksearch.org/topic/ann-trinity-tty-sha256d-scrypt-myr-groest-trinamic-block-value-system-1186025

proctologic

sr. member

Activity: 241

Merit: 250

Quote from: MinersPoolEU on October 22, 2015, 03:51:44 AM

Quote from: Laketear on October 22, 2015, 02:55:54 AM

What coin do you use Myriad-Groestl to mine?

There is :
- Saffroncoin
- Digibyte
- Myriadcoin

Trinitycoin

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Pallas is pretty good.

He bough a NVIDIA card and improved Neoscrypt 10% in a couple of weeks.

CUDA, foreign language foreign technology...

respect

drr0ss

member

Activity: 98

Merit: 10

Quote from: MaxDZ8 on October 23, 2015, 10:50:00 AM

Quote from: ghostlander on October 23, 2015, 07:55:35 AM

Aren't they async by default in SGminer?

I mean as of async_work_group_copy & friends. I have very mixed feelings on them, on my hardware, they just don't perform well not even in terms of bandwidth usage. Maybe on GCN1.1 they work better, IDK.
BTW, I would suggest to put a barrier(CLK_LOCAL_MEM_FENCE) right after the T-tables load. Besides, just have it looped!

Code:

for(ulong i = 0; i < 8; ++i) {
   local uint *tdst = T_local + 256 * 8; // not even really required if you alloc your local T tables in block
   for(int el = 0; el < 256; el += get_local_size(0)) {
   tdst[get_local_id(0)] = rotate(T_global[get_local_id(0)], i * 8);
   tdst += get_local_size(0);
   tsrc += get_local_size(0);
   }
}

That's more or less what pallas suggests. It is a fully coherent read. LDS has full scatter/gather capability, as long as you end in a different bank you're safe (here, it happens by construction).
Alternatively the first block can be loaded by async_work_group_copy, and you can derive T1_L from T0_L, LDS should provide you a massive latency reduction by not having to round-trip to the (hot) L1 cache.
Loops such as this are fully unrolled in most cases.

Oh how forcing a old man like me to retype the code and thinking

never mind and cheers Wink

drr0ss

member

Activity: 98

Merit: 10

Quote from: pallas on October 23, 2015, 03:16:32 AM

instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions.
or, better, make an unrolled loop.
that way it's much more compact and easier to debug.

Hi Pallas, can you share me your .cl, I will send you you some feeds.....

MaxDZ8

hero member

Activity: 672

Merit: 500

Quote from: ghostlander on October 23, 2015, 07:55:35 AM

Aren't they async by default in SGminer?

I mean as of async_work_group_copy & friends. I have very mixed feelings on them, on my hardware, they just don't perform well not even in terms of bandwidth usage. Maybe on GCN1.1 they work better, IDK.
BTW, I would suggest to put a barrier(CLK_LOCAL_MEM_FENCE) right after the T-tables load. Besides, just have it looped!

Code:

for(ulong i = 0; i < 8; ++i) {
   local uint *tdst = T_local + 256 * 8; // not even really required if you alloc your local T tables in block
   for(int el = 0; el < 256; el += get_local_size(0)) {
   tdst[get_local_id(0)] = rotate(T_global[get_local_id(0)], i * 8);
   tdst += get_local_size(0);
   tsrc += get_local_size(0);
   }
}

That's more or less what pallas suggests. It is a fully coherent read. LDS has full scatter/gather capability, as long as you end in a different bank you're safe (here, it happens by construction).
Alternatively the first block can be loaded by async_work_group_copy, and you can derive T1_L from T0_L, LDS should provide you a massive latency reduction by not having to round-trip to the (hot) L1 cache.
Loops such as this are fully unrolled in most cases.

ghostlander

legendary

Activity: 1244

Merit: 1020

No surrender, no retreat, no regret.

Quote from: pallas on October 23, 2015, 08:17:41 AM

Quote from: ghostlander on October 23, 2015, 07:55:35 AM

Thanks to those who have donated.

Quote from: pallas on October 23, 2015, 03:16:32 AM

instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions.
or, better, make an unrolled loop.
that way it's much more compact and easier to debug.

Umm, these are preprocessor directives. The decision is made at compile time. There is nothing left to unroll.

do you see the repeated instructions?
just change the "if" structure and you can remove them ;-)
i.e. if worksize <= 128 you need to do some additional rotates compared to default (256), some additional others if worksize == 64....
although the alternative for loop is a much more elegant solution and the difference in speed is negligible.

What you suggest results in less linear memory writes which isn't good usually. I prefer to avoid loops if possible.

Code:

T0_L[lclid] = T0[lclid];
T1_L[lclid] = rotate(T0[lclid], 8UL);
T2_L[lclid] = rotate(T0[lclid], 16UL);
T3_L[lclid] = rotate(T0[lclid], 24UL);
T4_L[lclid] = rotate(T0[lclid], 32UL);
T5_L[lclid] = rotate(T0[lclid], 40UL);
T6_L[lclid] = rotate(T0[lclid], 48UL);
T7_L[lclid] = rotate(T0[lclid], 56UL);
#if (WORKSIZE < 256)
T0_L[lclid + 128] = T0[lclid + 128];
T1_L[lclid + 128] = rotate(T0[lclid + 128], 8UL);
T2_L[lclid + 128] = rotate(T0[lclid + 128], 16UL);
T3_L[lclid + 128] = rotate(T0[lclid + 128], 24UL);
T4_L[lclid + 128] = rotate(T0[lclid + 128], 32UL);
T5_L[lclid + 128] = rotate(T0[lclid + 128], 40UL);
T6_L[lclid + 128] = rotate(T0[lclid + 128], 48UL);
T7_L[lclid + 128] = rotate(T0[lclid + 128], 56UL);
#endif
#if (WORKSIZE < 128)
T0_L[lclid + 64] = T0[lclid + 64];
T0_L[lclid + 192] = T0[lclid + 192];
T1_L[lclid + 64] = rotate(T0[lclid + 64], 8UL);
T1_L[lclid + 192] = rotate(T0[lclid + 192], 8UL);
T2_L[lclid + 64] = rotate(T0[lclid + 64], 16UL);
T2_L[lclid + 192] = rotate(T0[lclid + 192], 16UL);
T3_L[lclid + 64] = rotate(T0[lclid + 64], 24UL);
T3_L[lclid + 192] = rotate(T0[lclid + 192], 24UL);
T4_L[lclid + 64] = rotate(T0[lclid + 64], 32UL);
T4_L[lclid + 192] = rotate(T0[lclid + 192], 32UL);
T5_L[lclid + 64] = rotate(T0[lclid + 64], 40UL);
T5_L[lclid + 192] = rotate(T0[lclid + 192], 40UL);
T6_L[lclid + 64] = rotate(T0[lclid + 64], 48UL);
T6_L[lclid + 192] = rotate(T0[lclid + 192], 48UL);
T7_L[lclid + 64] = rotate(T0[lclid + 64], 56UL);
T7_L[lclid + 192] = rotate(T0[lclid + 192], 56UL);
#endif

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: ghostlander on October 23, 2015, 07:55:35 AM

Thanks to those who have donated.

Quote from: pallas on October 23, 2015, 03:16:32 AM

instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions.
or, better, make an unrolled loop.
that way it's much more compact and easier to debug.

Umm, these are preprocessor directives. The decision is made at compile time. There is nothing left to unroll.

do you see the repeated instructions?
just change the "if" structure and you can remove them ;-)
i.e. if worksize <= 128 you need to do some additional rotates compared to default (256), some additional others if worksize == 64....
although the alternative for loop is a much more elegant solution and the difference in speed is negligible.

ghostlander

legendary

Activity: 1244

Merit: 1020

No surrender, no retreat, no regret.

Thanks to those who have donated.

Quote from: pallas on October 23, 2015, 03:16:32 AM

instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions.
or, better, make an unrolled loop.
that way it's much more compact and easier to debug.

Umm, these are preprocessor directives. The decision is made at compile time. There is nothing left to unroll.

Quote from: MaxDZ8 on October 23, 2015, 04:27:28 AM

What is your experience with async block reads?

Aren't they async by default in SGminer?

MaxDZ8

hero member

Activity: 672

Merit: 500

What is your experience with async block reads?

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions.
or, better, make an unrolled loop.
that way it's much more compact and easier to debug.

wildduck

legendary

Activity: 966

Merit: 1001

Thanks a lot for your work.

mkrypt

newbie

Activity: 33

Merit: 0

thanks for your work, will send revenue for today in DGB to you. I put memclock to 150 MHz whenever possible, undervolt core and enjoy power saving

drr0ss

member

Activity: 98

Merit: 10

Quote from: ghostlander on October 22, 2015, 01:40:12 PM

Added support for work sizes of 64 and 128. The 1st one halves performance on Tahiti, the 2nd one adds +2% there in my case. YMMV

A single donation of 2179 DGB received by this moment. Come on miners, be generous!

On 280x -w 64 still generate HW errors, 128 is ok.

ghostlander

legendary

Activity: 1244

Merit: 1020

No surrender, no retreat, no regret.

Added support for work sizes of 64 and 128. The 1st one halves performance on Tahiti, the 2nd one adds +2% there in my case. YMMV

A single donation of 2179 DGB received by this moment. Come on miners, be generous!

MinersPoolEU

member

Activity: 96

Merit: 10

Quote from: Laketear on October 22, 2015, 02:55:54 AM

What coin do you use Myriad-Groestl to mine?

There is :
- Saffroncoin
- Digibyte
- Myriadcoin

Topic: New OpenCL Kernel for Myriad-Groestl (DGB, MYR, etc.) - page 2. (Read 37779 times)