Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 192.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: zawawa on December 20, 2016, 10:00:52 PM

I just pushed the new version to GitHub:

https://github.com/zawawawa/gatelessgate

More speed enhancements are coming very soon. Enjoy!

Slow...

Eqminer with -ca option 225Sol/s @ 60watt on the 1060 3gb (66% TDP)

You need to add another 100%

Here is the thread with the free and faster miner:

https://bitcointalksearch.org/topic/nicehash-eqm-zcash-nvidia-optimized-miner-maxwellpascal-cpu-mining-v104c-1677369

Subw

hero member

Activity: 672

Merit: 500

zawawa can you post your BTC address pls

ioglnx

sr. member

Activity: 574

Merit: 250

Fighting mob law and inquisition in this forum

Quote from: laik2 on December 21, 2016, 03:04:47 AM

Quote from: zawawa on December 20, 2016, 06:57:31 PM

Quote from: laik2 on December 19, 2016, 03:46:08 PM

Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...

I just tested GG on your server. I think I already fixed the problem.
(The next version should be much more stable overall.)
I will push it to the repo in the next few hours, so you can check it yourself.

This compatibility issue turned out to be much more complicated than I thought, and I am afraid I need to drop support for fglrx for now. I am practically a one-man development team, and I don't have resources to support outdated drivers that are known to be notoriously buggy. I continue to support AMDPRO drivers, though.

RX cards should work well with amdgpu-pro but R9 are currently poorly supported...

But is this zawawa's fault or job to fix a issue from AMD? He puts a lot of efforts in something that is broken by design get you workarounds.
I even wonder he jumped in and put efforts into this for flgrx.

Indeed a good question would be how optiminer or claymores linux miner behaves with flgrx ..does it work?

laik2

sr. member

Activity: 652

Merit: 266

Quote from: zawawa on December 20, 2016, 06:57:31 PM

Quote from: laik2 on December 19, 2016, 03:46:08 PM

Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...

I just tested GG on your server. I think I already fixed the problem.
(The next version should be much more stable overall.)
I will push it to the repo in the next few hours, so you can check it yourself.

This compatibility issue turned out to be much more complicated than I thought, and I am afraid I need to drop support for fglrx for now. I am practically a one-man development team, and I don't have resources to support outdated drivers that are known to be notoriously buggy. I continue to support AMDPRO drivers, though.

RX cards should work well with amdgpu-pro but R9 are currently poorly supported...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I just pushed the new version to GitHub:

https://github.com/zawawawa/gatelessgate

More speed enhancements are coming very soon. Enjoy!

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: laik2 on December 19, 2016, 03:46:08 PM

Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...

I just tested GG on your server. I think I already fixed the problem.
(The next version should be much more stable overall.)
I will push it to the repo in the next few hours, so you can check it yourself.

This compatibility issue turned out to be much more complicated than I thought, and I am afraid I need to drop support for fglrx for now. I am practically a one-man development team, and I don't have resources to support outdated drivers that are known to be notoriously buggy. I continue to support AMDPRO drivers, though.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Alright peeps, bug fixes for the next version is almost done.
I'm getting 164 sol/s with RX 480 and 128 sol/s with GTX 1060 3GB.
That should be good enough for now.
I will upload the new version tonight, US PST.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: krnlx on December 20, 2016, 10:28:01 AM

Quote from: zawawa on December 20, 2016, 10:06:05 AM

Quote from: krnlx on December 20, 2016, 09:56:35 AM

Quote from: zawawa on December 20, 2016, 06:40:31 AM

By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.)
I am also thinking about rewriting GG in CUDA for a better performance.
The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.

I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig.

In openCL you can inline nvidia ptx asm easy, like in cuda.

Thank you so much for letting me know.
What I specifically had in my mind was "shfl."
If that instruction can be exposed through inline PTX, I can save a considerable amount of time.

From cuda include files:

Code:

int __shfl(int var, int srcLane, int width) {
        int ret;
        int c = ((warpSize-width) << 8) | 0x1f;
        asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=r"(ret) : "r"(var), "r"(srcLane), "r"(c));
        return ret;
}

Awesome!

krnlx

full member

Activity: 243

Merit: 105

Quote from: zawawa on December 20, 2016, 10:06:05 AM

Quote from: krnlx on December 20, 2016, 09:56:35 AM

Quote from: zawawa on December 20, 2016, 06:40:31 AM

By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.)
I am also thinking about rewriting GG in CUDA for a better performance.
The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.

I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig.

In openCL you can inline nvidia ptx asm easy, like in cuda.

Thank you so much for letting me know.
What I specifically had in my mind was "shfl."
If that instruction can be exposed through inline PTX, I can save a considerable amount of time.

From cuda include files:

Code:

int __shfl(int var, int srcLane, int width) {
        int ret;
        int c = ((warpSize-width) << 8) | 0x1f;
        asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=r"(ret) : "r"(var), "r"(srcLane), "r"(c));
        return ret;
}

maztheman

newbie

Activity: 9

Merit: 0

Quote from: zawawa on December 20, 2016, 10:06:05 AM

Quote from: krnlx on December 20, 2016, 09:56:35 AM

Quote from: zawawa on December 20, 2016, 06:40:31 AM

By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.)
I am also thinking about rewriting GG in CUDA for a better performance.
The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.

I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig.

In openCL you can inline nvidia ptx asm easy, like in cuda.

Thank you so much for letting me know.
What I specifically had in my mind was "shfl."
If that instruction can be exposed through inline PTX, I can save a considerable amount of time.

Yeah you can but check the compute version needed, I think its compute 3.2+.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: qwep1 on December 20, 2016, 09:12:55 AM

whe are i download miner for win

https://github.com/zawawawa/gatelessgate

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: krnlx on December 20, 2016, 09:56:35 AM

Quote from: zawawa on December 20, 2016, 06:40:31 AM

By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.)
I am also thinking about rewriting GG in CUDA for a better performance.
The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.

I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig.

In openCL you can inline nvidia ptx asm easy, like in cuda.

Thank you so much for letting me know.
What I specifically had in my mind was "shfl."
If that instruction can be exposed through inline PTX, I can save a considerable amount of time.

krnlx

full member

Activity: 243

Merit: 105

Quote from: zawawa on December 20, 2016, 06:40:31 AM

By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.)
I am also thinking about rewriting GG in CUDA for a better performance.
The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.

I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig.

In openCL you can inline nvidia ptx asm easy, like in cuda.

qwep1

hero member

Activity: 610

Merit: 500

whe are i download miner for win

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.)
I am also thinking about rewriting GG in CUDA for a better performance.
The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: laik2 on December 19, 2016, 03:46:08 PM

Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...

I am preparing the next point release today. PM me when your Linux servers are ready. I will do my best to make GG compatible with fglrx.

laik2

sr. member

Activity: 652

Merit: 266

Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...

chown.multi

newbie

Activity: 28

Merit: 0

Quote from: zawawa on December 19, 2016, 01:06:29 PM

I just noticed three things:

(1) With NR_ROWS_LOG=14, Rounds 1 through 8 are actually much faster.
(2) However, kernel_sols() becomes a bottleneck because there are just too many slots in one row. This problem can be easily solved by using a different sorting algorithm for kernel_sols().
(3) Currently, NR_ROWS_LOG<14 is not possible because there is not enough space in shared memory. This problem can be partially solved by making NR_ROWS_LOG variable across rounds. Since less space in shared memory is required for caching slots at later rounds, NR_ROWS_LOG can be decreased accordingly.

This is it. I'm catching up with Claymore's and Eqminer.

That is good news, keep up the good work.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I just noticed three things:

(1) With NR_ROWS_LOG=14, Rounds 1 through 8 are actually much faster.
(2) However, kernel_sols() becomes a bottleneck because there are just too many slots in one row. This problem can be easily solved by using a different sorting algorithm for kernel_sols().
(3) Currently, NR_ROWS_LOG<14 is not possible because there is not enough space in shared memory. This problem can be partially solved by making NR_ROWS_LOG variable across rounds. Since less space in shared memory is required for caching slots at later rounds, NR_ROWS_LOG can be decreased accordingly.

This is it. I'm catching up with Claymore's and Eqminer.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 192. (Read 214463 times)