Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 193.

kilo17

legendary

Activity: 980

Merit: 1001

aka "whocares"

Not sure if you can help out on this or not. I am trying out the miner on Ubuntu 16.10 with 4.9 kernel and open source drivers and etc.

I changed the opencl location in the make file but had similar results to Eliovp:

Code:

echo 'const char *ocl_code = R"_mrb_(' >_kernel.h
cpp input.cl >>_kernel.h
echo ')_mrb_";' >>_kernel.h
gcc -O2 -std=gnu99 -pedantic -Wextra -Wall -Wno-deprecated-declarations -Wno-overlength-strings -I"/opt/AMDAPPSDK-3.0/include"  -c -o main.o main.c
main.c: In function ‘examine_ht’:
main.c:534:26: warning: unused parameter ‘round’ [-Wunused-parameter]
 void examine_ht(unsigned round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer)
                          ^~~~~
main.c:534:50: warning: unused parameter ‘queue’ [-Wunused-parameter]
 void examine_ht(unsigned round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer)
                                                  ^~~~~
main.c:534:65: warning: unused parameter ‘hash_table_buffers’ [-Wunused-parameter]
 void examine_ht(unsigned round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer)
                                                                 ^~~~~~~~~~~~~~~~~~
main.c:534:92: warning: unused parameter ‘row_counters_buffer’ [-Wunused-parameter]
 d round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer)
                                                                     ^~~~~~~~~~~~~~~~~~~
main.c: In function ‘store_encoded_sol’:
main.c:640:25: warning: left shift of negative value [-Wshift-negative-value]
    uint32_t mask = ~(-1 << (8 - x_bits_used));
                         ^~
main.c: In function ‘solve_equihash’:
main.c:958:57: warning: unused parameter ‘ctx’ [-Wunused-parameter]
 uint32_t solve_equihash(cl_device_id dev_id, cl_context ctx, cl_command_queue queue,
                                                         ^~~
main.c: In function ‘mining_mode’:
main.c:1408:18: warning: unused variable ‘status’ [-Wunused-variable]
  cl_int          status;
                  ^~~~~~
main.c:1393:50: warning: unused parameter ‘program’ [-Wunused-parameter]
 void mining_mode(cl_device_id dev_id, cl_program program, cl_context ctx, cl_command_queue queue,
                                                  ^~~~~~~
gcc -O2 -std=gnu99 -pedantic -Wextra -Wall -Wno-deprecated-declarations -Wno-overlength-strings -I"/opt/AMDAPPSDK-3.0/include"  -c -o blake.o blake.c
blake.c:26:25: warning: ‘blake2b_block_len’ defined but not used [-Wunused-const-variable=]
 static const uint32_t   blake2b_block_len = 128;
                         ^~~~~~~~~~~~~~~~~
gcc -O2 -std=gnu99 -pedantic -Wextra -Wall -Wno-deprecated-declarations -Wno-overlength-strings -I"/opt/AMDAPPSDK-3.0/include"  -c -o sha256.o sha256.c
gcc -o sa-solver main.o blake.o sha256.o -rdynamic -L"/usr/lib/x86_64-linux-gnu" -lOpenCL

and then:

Code:

kilo17@kilo-GT7:~/gatelessgate-master$ ./gatelessgate.py -c stratum+tcp://us1-zcash.flypool.org:3333 -u t1cVviFvgJinQ4w3C2m2CfRxgP5DnHYaoFC
Gateless Gate, a Zcash miner
Copyright 2016 zawawa @ bitcointalk.org
Connecting to us1-zcash.flypool.org:3333
Stratum server sent us the first job
Mining on 1 device

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

This should be pretty useful once I get down to the GCN assembly:

https://community.amd.com/thread/165710

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: jiggytom on December 18, 2016, 02:29:13 PM

Quote from: zawawa on December 18, 2016, 12:41:37 PM

I think I figured out how to coalesce global memory reads.
(Memory writes cannot be coalesced because the destination of each slot is not predictable.)
It would have been impossible with the original design of SA, but it should be possible with GG because it loads slots differently.
If everything works out, there should be a massive speedup, hehehe...

Great news! Does that include CUDA also?

Definitely for NVIDIA, and potentially for AMD. I already have some positive results, but I still need to reduce the overhead and bring NR_ROWS_LOG down to 13. More work, more work...

jiggytom

legendary

Activity: 1068

Merit: 1020

Quote from: zawawa on December 18, 2016, 12:41:37 PM

I think I figured out how to coalesce global memory reads.
(Memory writes cannot be coalesced because the destination of each slot is not predictable.)
It would have been impossible with the original design of SA, but it should be possible with GG because it loads slots differently.
If everything works out, there should be a massive speedup, hehehe...

Great news! Does that include CUDA also?

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I think I figured out how to coalesce global memory reads.
(Memory writes cannot be coalesced because the destination of each slot is not predictable.)
It would have been impossible with the original design of SA, but it should be possible with GG because it loads slots differently.
If everything works out, there should be a massive speedup, hehehe...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: laik2 on December 18, 2016, 05:17:30 AM

Quote from: zawawa on December 18, 2016, 02:54:16 AM

I just pushed a workaround for fglrx to the repo.
Let me know if that works.

Still the same...Login and access credetials are the same with the 14.04 machine, you can check it out whenever you want.

EDIT: There is also something that keeps bothering me. On 16.04 clinfo recognized 14 CU out of 40 and was doing 180S/s, with fglrx all CUs were recognized correctly and the hash speed was the same. Do you think that this could be due to not fully utilizing CUs on the chip?

Got it, thanks! Let me get to that when I'm done with GTX 1060.

laik2

sr. member

Activity: 652

Merit: 266

Quote from: zawawa on December 18, 2016, 02:54:16 AM

I just pushed a workaround for fglrx to the repo.
Let me know if that works.

Still the same...Login and access credetials are the same with the 14.04 machine, you can check it out whenever you want.

EDIT: There is also something that keeps bothering me. On 16.04 clinfo recognized 14 CU out of 40 and was doing 180S/s, with fglrx all CUs were recognized correctly and the hash speed was the same. Do you think that this could be due to not fully utilizing CUs on the chip?

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I just pushed a workaround for fglrx to the repo.
Let me know if that works.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on December 17, 2016, 09:19:29 AM

If that is the issue, then it could be solved by synchronizing the kernel so all CUs are reading at the same time (copying slots to the LDS), then they all write at the same time.

One more reason to use a GCN assembler, then. How exciting!

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: reb0rn21 on December 17, 2016, 08:24:25 PM

For Nvidia you must solve specific memory access, duno if anyone managed to get full coalesced memory transaction which should provide max performance...

atm nicehash miner is doung 215-250sol/s on 1060 6GB (3GB version is less because GPU is crippled)

Thanks a lot for the heads up. I figured out a way to rearrange elements in the hash tables efficiently, so I can run a ton of experiments to see which access pattern would results in the best performance.

reb0rn21

legendary

Activity: 1901

Merit: 1024

For Nvidia you must solve specific memory access, duno if anyone managed to get full coalesced memory transaction which should provide max performance...

atm nicehash miner is doung 215-250sol/s on 1060 6GB (3GB version is less because GPU is crippled)

Kompik

sr. member

Activity: 463

Merit: 250

Quote from: zawawa on December 17, 2016, 07:03:52 PM

I'm getting 90 sol/s with GTX 1060... So, I can catch up with Eqminer if I reach 180 sol/s. I see.

Great!!

Looking forward to 200 sols on the 1060 on your miner!!

m0niker

newbie

Activity: 39

Merit: 0

Let me know if you need anyone with windows boxes to test, have a few 480s and 7970s and windows 7/10 around, and would be glad to help with testing. Thanks for doing this open source!

ioglnx

sr. member

Activity: 574

Merit: 250

Fighting mob law and inquisition in this forum

Good luck bro^^
But you should overrun them..and beat :-D

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I'm getting 128 sol/s after 10 min of tweaking.
This whole thing doesn't look hard at all...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I'm getting 90 sol/s with GTX 1060... So, I can catch up with Eqminer if I reach 180 sol/s. I see.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: reflexmk on December 17, 2016, 06:49:48 PM

Please do include a executable of the miner, for those of us that dont know how to compile from source. Tnx

I will do that with each point release. Both for AMD and NVIDIA now. No worries.

reflexmk

sr. member

Activity: 289

Merit: 250

Please do include a executable of the miner, for those of us that dont know how to compile from source. Tnx

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: laik2 on December 17, 2016, 05:24:12 PM

Quote from: zawawa on December 17, 2016, 05:07:10 PM

I know, I know... The newly added complexity actually bothered me quite a bit and I feel bad about making you go through it, but it was necessary to ensure the correctness of the code and maximize LDS usage and thus occupancy. I feel like I have exhausted all the means of optimization at the OpenCL level except for an automatic optimizer as far as RX 480 is concerned. Once I'm done with an on-the-fly optimizer, I will delve into the GCN assembly. I have been experimenting with global synchronization with some pretty interesting results.

As for Tonga and Hawaii, I used to own a whole bunch of them, but I sold them all... I'm thinking about getting a used Nano for testing purposes.

By the way, a new GTX 1060 finally arrived, so I can optimize the miner for NVIDIA cards as well. Good stuff.

Commit 10 and 11 reporting 0S/s on R9 390 under 14.04 fglrx, although GPU usage is 100%

Well, it's fglrx... I don't think the kernel even successfully builds with it. I will implement a workaround.

laik2

sr. member

Activity: 652

Merit: 266

Quote from: zawawa on December 17, 2016, 05:07:10 PM

I know, I know... The newly added complexity actually bothered me quite a bit and I feel bad about making you go through it, but it was necessary to ensure the correctness of the code and maximize LDS usage and thus occupancy. I feel like I have exhausted all the means of optimization at the OpenCL level except for an automatic optimizer as far as RX 480 is concerned. Once I'm done with an on-the-fly optimizer, I will delve into the GCN assembly. I have been experimenting with global synchronization with some pretty interesting results.

As for Tonga and Hawaii, I used to own a whole bunch of them, but I sold them all... I'm thinking about getting a used Nano for testing purposes.

By the way, a new GTX 1060 finally arrived, so I can optimize the miner for NVIDIA cards as well. Good stuff.

Commit 10 and 11 reporting 0S/s on R9 390 under 14.04 fglrx, although GPU usage is 100%

Quote

Gateless Gate, a Zcash miner
Copyright 2016 zawawa @ bitcointalk.org
Connecting to eu1-zcash.flypool.org:3333
Solver 0.0: launching
Successfully connected to eu1-zcash.flypool.org:3333
Received target 0020c49ba5e353f7ced916872b020c49ba5e353f7ced916872b020c49ba5e353
Received job "a50e8e46b67264ee610b"
Stratum server sent us the first job
Mining on 1 device
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 193. (Read 214456 times)