Author

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 192. (Read 214410 times)

sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
I just pushed the new version to GitHub:

https://github.com/zawawawa/gatelessgate

More speed enhancements are coming very soon. Enjoy!

Slow...

Eqminer with -ca option 225Sol/s @ 60watt on the 1060 3gb (66% TDP)

You need to add another 100%

Here is the thread with the free and faster miner:

https://bitcointalksearch.org/topic/nicehash-eqm-zcash-nvidia-optimized-miner-maxwellpascal-cpu-mining-v104c-1677369
hero member
Activity: 672
Merit: 500
zawawa can you post your BTC address pls
sr. member
Activity: 574
Merit: 250
Fighting mob law and inquisition in this forum
Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...

I just tested GG on your server. I think I already fixed the problem.
(The next version should be much more stable overall.)
I will push it to the repo in the next few hours, so you can check it yourself.


This compatibility issue turned out to be much more complicated than I thought, and I am afraid I need to drop support for fglrx for now. I am practically a one-man development team, and I don't have resources to support outdated drivers that are known to be notoriously buggy. I continue to support AMDPRO drivers, though.
RX cards should work well with amdgpu-pro but R9 are currently poorly supported...

But is this zawawa's fault or job to fix a issue from AMD? He puts a lot of efforts in something that is broken by design get you workarounds.
I even wonder he jumped in and put efforts into this for flgrx.

Indeed a good question would be how optiminer or claymores linux miner behaves with flgrx ..does it work?
sr. member
Activity: 652
Merit: 266
Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...

I just tested GG on your server. I think I already fixed the problem.
(The next version should be much more stable overall.)
I will push it to the repo in the next few hours, so you can check it yourself.


This compatibility issue turned out to be much more complicated than I thought, and I am afraid I need to drop support for fglrx for now. I am practically a one-man development team, and I don't have resources to support outdated drivers that are known to be notoriously buggy. I continue to support AMDPRO drivers, though.
RX cards should work well with amdgpu-pro but R9 are currently poorly supported...
sr. member
Activity: 728
Merit: 304
Miner Developer
I just pushed the new version to GitHub:

https://github.com/zawawawa/gatelessgate

More speed enhancements are coming very soon. Enjoy!
sr. member
Activity: 728
Merit: 304
Miner Developer
Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...

I just tested GG on your server. I think I already fixed the problem.
(The next version should be much more stable overall.)
I will push it to the repo in the next few hours, so you can check it yourself.


This compatibility issue turned out to be much more complicated than I thought, and I am afraid I need to drop support for fglrx for now. I am practically a one-man development team, and I don't have resources to support outdated drivers that are known to be notoriously buggy. I continue to support AMDPRO drivers, though.
sr. member
Activity: 728
Merit: 304
Miner Developer
Alright peeps, bug fixes for the next version is almost done.
I'm getting 164 sol/s with RX 480 and 128 sol/s with GTX 1060 3GB.
That should be good enough for now.
I will upload the new version tonight, US PST.
sr. member
Activity: 728
Merit: 304
Miner Developer
By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.)
I am also thinking about rewriting GG in CUDA for a better performance.
The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.

I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig.

In openCL you can inline nvidia ptx asm easy, like in cuda.

Thank you so much for letting me know.
What I specifically had in my mind was "shfl."
If that instruction can be exposed through inline PTX, I can save a considerable amount of time.

From cuda include files:
Code:
int __shfl(int var, int srcLane, int width) {
        int ret;
        int c = ((warpSize-width) << 8) | 0x1f;
        asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=r"(ret) : "r"(var), "r"(srcLane), "r"(c));
        return ret;
}



Awesome!
full member
Activity: 243
Merit: 105
By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.)
I am also thinking about rewriting GG in CUDA for a better performance.
The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.

I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig.

In openCL you can inline nvidia ptx asm easy, like in cuda.

Thank you so much for letting me know.
What I specifically had in my mind was "shfl."
If that instruction can be exposed through inline PTX, I can save a considerable amount of time.

From cuda include files:
Code:
int __shfl(int var, int srcLane, int width) {
        int ret;
        int c = ((warpSize-width) << 8) | 0x1f;
        asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=r"(ret) : "r"(var), "r"(srcLane), "r"(c));
        return ret;
}

newbie
Activity: 9
Merit: 0
By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.)
I am also thinking about rewriting GG in CUDA for a better performance.
The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.

I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig.

In openCL you can inline nvidia ptx asm easy, like in cuda.

Thank you so much for letting me know.
What I specifically had in my mind was "shfl."
If that instruction can be exposed through inline PTX, I can save a considerable amount of time.

Yeah you can but check the compute version needed, I think its compute 3.2+. 
sr. member
Activity: 728
Merit: 304
Miner Developer
sr. member
Activity: 728
Merit: 304
Miner Developer
By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.)
I am also thinking about rewriting GG in CUDA for a better performance.
The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.

I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig.

In openCL you can inline nvidia ptx asm easy, like in cuda.

Thank you so much for letting me know.
What I specifically had in my mind was "shfl."
If that instruction can be exposed through inline PTX, I can save a considerable amount of time.
full member
Activity: 243
Merit: 105
By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.)
I am also thinking about rewriting GG in CUDA for a better performance.
The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.

I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig.

In openCL you can inline nvidia ptx asm easy, like in cuda.
hero member
Activity: 610
Merit: 500
whe are i download miner for win
sr. member
Activity: 728
Merit: 304
Miner Developer
By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.)
I am also thinking about rewriting GG in CUDA for a better performance.
The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.
sr. member
Activity: 728
Merit: 304
Miner Developer
Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...

I am preparing the next point release today. PM me when your Linux servers are ready. I will do my best to make GG compatible with fglrx.
sr. member
Activity: 652
Merit: 266
Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...
newbie
Activity: 28
Merit: 0
I just noticed three things:

(1) With NR_ROWS_LOG=14, Rounds 1 through 8 are actually much faster.
(2) However, kernel_sols() becomes a bottleneck because there are just too many slots in one row. This problem can be easily solved by using a different sorting algorithm for kernel_sols().
(3) Currently, NR_ROWS_LOG<14 is not possible because there is not enough space in shared memory. This problem can be partially solved by making NR_ROWS_LOG variable across rounds. Since less space in shared memory is required for caching slots at later rounds, NR_ROWS_LOG can be decreased accordingly.

This is it. I'm catching up with Claymore's and Eqminer.

That is good news, keep up the good work.
sr. member
Activity: 728
Merit: 304
Miner Developer
I just noticed three things:

(1) With NR_ROWS_LOG=14, Rounds 1 through 8 are actually much faster.
(2) However, kernel_sols() becomes a bottleneck because there are just too many slots in one row. This problem can be easily solved by using a different sorting algorithm for kernel_sols().
(3) Currently, NR_ROWS_LOG<14 is not possible because there is not enough space in shared memory. This problem can be partially solved by making NR_ROWS_LOG variable across rounds. Since less space in shared memory is required for caching slots at later rounds, NR_ROWS_LOG can be decreased accordingly.

This is it. I'm catching up with Claymore's and Eqminer.
sr. member
Activity: 728
Merit: 304
Miner Developer
Jump to: