Pages:
Author

Topic: New OpenCL Kernel for Myriad-Groestl (DGB, MYR, etc.) - page 2. (Read 37760 times)

newbie
Activity: 82
Merit: 0
7970 is doing fine with sgminer 5.1.1. - 35 mh/s ....but my R9 290X is bonkers, only 40-42 mh/s is very low :/ Can someone help, pls send me a kernel or bin so i can make it work...I've tryied ghostlenders myriad-groestl.cl ...still 7970 - 35 mh/s and R9 290X - 40-42 mh/s :/ If someone needs work with video editing, i can make it happen....privat message me, Kind Regards Ivo Icevski
sr. member
Activity: 283
Merit: 250
Wondering if someone would compile this for windows. Would be much appreciated!
hero member
Activity: 1008
Merit: 1000
nice work 7950 went from 7 mh to 27 mh, but I think mining the skein will still earn you more with dgb.

On the technical side that kind of efficiency improvement is simply amazing.  Shocked
newbie
Activity: 57
Merit: 0
Radeon HD7790 1200/1600 mining MYR got 17.5 Mh on this kernel. It's two times faster than the original  Shocked
sr. member
Activity: 241
Merit: 250


I don't understand why you'd be using a 290 to mine Myriad-Groestl on linux?  Wrong algo.
huh ?
I don't understand why you are posting that ? wrong or random answer...

AFAIK, Myriad-Groestl is only used by DGB and MYR.  And if you're gonna mine either of those coins with a 290 (I have half a dozen 290s), Skein is by far the better algo for that particular gpu.  See https://bitcointalksearch.org/topic/skein-pimp-profit-switching-dgbmyr-merged-mining-uis-inc-p2pool-nodes-1186670 for more details.




Trinity to http://coinspool.cu.cc/workers_trinity

https://bitcointalksearch.org/topic/ann-trinity-tty-sha256d-scrypt-myr-groest-trinamic-block-value-system-1186025
sr. member
Activity: 241
Merit: 250
What coin do you use Myriad-Groestl to mine?
There is :
- Saffroncoin
- Digibyte
- Myriadcoin


Trinitycoin
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Pallas is pretty good.

He bough a NVIDIA card and improved Neoscrypt 10% in a couple of weeks.

CUDA, foreign language foreign technology...

respect
member
Activity: 98
Merit: 10
Aren't they async by default in SGminer?
I mean as of async_work_group_copy & friends. I have very mixed feelings on them, on my hardware, they just don't perform well not even in terms of bandwidth usage. Maybe on GCN1.1 they work better, IDK.
BTW, I would suggest to put a barrier(CLK_LOCAL_MEM_FENCE) right after the T-tables load. Besides, just have it looped!
Code:
for(ulong i = 0; i < 8; ++i) {
    local uint *tdst = T_local + 256 * 8; // not even really required if you alloc your local T tables in block
    for(int el = 0; el < 256; el += get_local_size(0)) {
        tdst[get_local_id(0)] = rotate(T_global[get_local_id(0)], i * 8);
        tdst += get_local_size(0);
        tsrc += get_local_size(0);
    }
}
That's more or less what pallas suggests. It is a fully coherent read. LDS has full scatter/gather capability, as long as you end in a different bank you're safe (here, it happens by construction).
Alternatively the first block can be loaded by async_work_group_copy, and you can derive T1_L from T0_L, LDS should provide you a massive latency reduction by not having to round-trip to the (hot) L1 cache.
Loops such as this are fully unrolled in most cases.

Oh how forcing a old man like me to retype the code and thinking Smiley
never mind and cheers Wink
member
Activity: 98
Merit: 10
instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions.
or, better, make an unrolled loop.
that way it's much more compact and easier to debug.

Hi Pallas, can you share me your .cl, I will send you you some feeds.....
hero member
Activity: 672
Merit: 500
Aren't they async by default in SGminer?
I mean as of async_work_group_copy & friends. I have very mixed feelings on them, on my hardware, they just don't perform well not even in terms of bandwidth usage. Maybe on GCN1.1 they work better, IDK.
BTW, I would suggest to put a barrier(CLK_LOCAL_MEM_FENCE) right after the T-tables load. Besides, just have it looped!
Code:
for(ulong i = 0; i < 8; ++i) {
    local uint *tdst = T_local + 256 * 8; // not even really required if you alloc your local T tables in block
    for(int el = 0; el < 256; el += get_local_size(0)) {
        tdst[get_local_id(0)] = rotate(T_global[get_local_id(0)], i * 8);
        tdst += get_local_size(0);
        tsrc += get_local_size(0);
    }
}
That's more or less what pallas suggests. It is a fully coherent read. LDS has full scatter/gather capability, as long as you end in a different bank you're safe (here, it happens by construction).
Alternatively the first block can be loaded by async_work_group_copy, and you can derive T1_L from T0_L, LDS should provide you a massive latency reduction by not having to round-trip to the (hot) L1 cache.
Loops such as this are fully unrolled in most cases.
legendary
Activity: 1242
Merit: 1020
No surrender, no retreat, no regret.
Thanks to those who have donated.

instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions.
or, better, make an unrolled loop.
that way it's much more compact and easier to debug.

Umm, these are preprocessor directives. The decision is made at compile time. There is nothing left to unroll.

do you see the repeated instructions?
just change the "if" structure and you can remove them ;-)
i.e. if worksize <= 128 you need to do some additional rotates compared to default (256), some additional others if worksize == 64....
although the alternative for loop is a much more elegant solution and the difference in speed is negligible.

What you suggest results in less linear memory writes which isn't good usually. I prefer to avoid loops if possible.

Code:
    T0_L[lclid] = T0[lclid];
    T1_L[lclid] = rotate(T0[lclid], 8UL);
    T2_L[lclid] = rotate(T0[lclid], 16UL);
    T3_L[lclid] = rotate(T0[lclid], 24UL);
    T4_L[lclid] = rotate(T0[lclid], 32UL);
    T5_L[lclid] = rotate(T0[lclid], 40UL);
    T6_L[lclid] = rotate(T0[lclid], 48UL);
    T7_L[lclid] = rotate(T0[lclid], 56UL);
#if (WORKSIZE < 256)
    T0_L[lclid + 128] = T0[lclid + 128];
    T1_L[lclid + 128] = rotate(T0[lclid + 128], 8UL);
    T2_L[lclid + 128] = rotate(T0[lclid + 128], 16UL);
    T3_L[lclid + 128] = rotate(T0[lclid + 128], 24UL);
    T4_L[lclid + 128] = rotate(T0[lclid + 128], 32UL);
    T5_L[lclid + 128] = rotate(T0[lclid + 128], 40UL);
    T6_L[lclid + 128] = rotate(T0[lclid + 128], 48UL);
    T7_L[lclid + 128] = rotate(T0[lclid + 128], 56UL);
#endif
#if (WORKSIZE < 128)
    T0_L[lclid + 64] = T0[lclid + 64];
    T0_L[lclid + 192] = T0[lclid + 192];
    T1_L[lclid + 64] = rotate(T0[lclid + 64], 8UL);
    T1_L[lclid + 192] = rotate(T0[lclid + 192], 8UL);
    T2_L[lclid + 64] = rotate(T0[lclid + 64], 16UL);
    T2_L[lclid + 192] = rotate(T0[lclid + 192], 16UL);
    T3_L[lclid + 64] = rotate(T0[lclid + 64], 24UL);
    T3_L[lclid + 192] = rotate(T0[lclid + 192], 24UL);
    T4_L[lclid + 64] = rotate(T0[lclid + 64], 32UL);
    T4_L[lclid + 192] = rotate(T0[lclid + 192], 32UL);
    T5_L[lclid + 64] = rotate(T0[lclid + 64], 40UL);
    T5_L[lclid + 192] = rotate(T0[lclid + 192], 40UL);
    T6_L[lclid + 64] = rotate(T0[lclid + 64], 48UL);
    T6_L[lclid + 192] = rotate(T0[lclid + 192], 48UL);
    T7_L[lclid + 64] = rotate(T0[lclid + 64], 56UL);
    T7_L[lclid + 192] = rotate(T0[lclid + 192], 56UL);
#endif
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Thanks to those who have donated.

instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions.
or, better, make an unrolled loop.
that way it's much more compact and easier to debug.

Umm, these are preprocessor directives. The decision is made at compile time. There is nothing left to unroll.

do you see the repeated instructions?
just change the "if" structure and you can remove them ;-)
i.e. if worksize <= 128 you need to do some additional rotates compared to default (256), some additional others if worksize == 64....
although the alternative for loop is a much more elegant solution and the difference in speed is negligible.
legendary
Activity: 1242
Merit: 1020
No surrender, no retreat, no regret.
Thanks to those who have donated.

instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions.
or, better, make an unrolled loop.
that way it's much more compact and easier to debug.

Umm, these are preprocessor directives. The decision is made at compile time. There is nothing left to unroll.

What is your experience with async block reads?

Aren't they async by default in SGminer?
hero member
Activity: 672
Merit: 500
What is your experience with async block reads?
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions.
or, better, make an unrolled loop.
that way it's much more compact and easier to debug.
legendary
Activity: 966
Merit: 1001
Thanks a lot for your work.
newbie
Activity: 33
Merit: 0
thanks for your work, will send revenue for today in DGB to you. I put memclock to 150 MHz whenever possible, undervolt core and enjoy power saving
member
Activity: 98
Merit: 10
Added support for work sizes of 64 and 128. The 1st one halves performance on Tahiti, the 2nd one adds +2% there in my case. YMMV

A single donation of 2179 DGB received by this moment. Come on miners, be generous!


On 280x -w 64 still generate HW errors, 128 is ok.
legendary
Activity: 1242
Merit: 1020
No surrender, no retreat, no regret.
Added support for work sizes of 64 and 128. The 1st one halves performance on Tahiti, the 2nd one adds +2% there in my case. YMMV

A single donation of 2179 DGB received by this moment. Come on miners, be generous!
member
Activity: 96
Merit: 10
What coin do you use Myriad-Groestl to mine?
There is :
- Saffroncoin
- Digibyte
- Myriadcoin
Pages:
Jump to: