Author

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 193. (Read 214456 times)

legendary
Activity: 980
Merit: 1001
aka "whocares"
Not sure if you can help out on this or not.  I am trying out the miner on Ubuntu 16.10 with 4.9 kernel and open source drivers and etc.

I changed the opencl location in the make file but had similar results to Eliovp:

Code:
echo 'const char *ocl_code = R"_mrb_(' >_kernel.h
cpp input.cl >>_kernel.h
echo ')_mrb_";' >>_kernel.h
gcc -O2 -std=gnu99 -pedantic -Wextra -Wall -Wno-deprecated-declarations -Wno-overlength-strings -I"/opt/AMDAPPSDK-3.0/include"  -c -o main.o main.c
main.c: In function ‘examine_ht’:
main.c:534:26: warning: unused parameter ‘round’ [-Wunused-parameter]
 void examine_ht(unsigned round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer)
                          ^~~~~
main.c:534:50: warning: unused parameter ‘queue’ [-Wunused-parameter]
 void examine_ht(unsigned round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer)
                                                  ^~~~~
main.c:534:65: warning: unused parameter ‘hash_table_buffers’ [-Wunused-parameter]
 void examine_ht(unsigned round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer)
                                                                 ^~~~~~~~~~~~~~~~~~
main.c:534:92: warning: unused parameter ‘row_counters_buffer’ [-Wunused-parameter]
 d round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer)
                                                                     ^~~~~~~~~~~~~~~~~~~
main.c: In function ‘store_encoded_sol’:
main.c:640:25: warning: left shift of negative value [-Wshift-negative-value]
    uint32_t mask = ~(-1 << (8 - x_bits_used));
                         ^~
main.c: In function ‘solve_equihash’:
main.c:958:57: warning: unused parameter ‘ctx’ [-Wunused-parameter]
 uint32_t solve_equihash(cl_device_id dev_id, cl_context ctx, cl_command_queue queue,
                                                         ^~~
main.c: In function ‘mining_mode’:
main.c:1408:18: warning: unused variable ‘status’ [-Wunused-variable]
  cl_int          status;
                  ^~~~~~
main.c:1393:50: warning: unused parameter ‘program’ [-Wunused-parameter]
 void mining_mode(cl_device_id dev_id, cl_program program, cl_context ctx, cl_command_queue queue,
                                                  ^~~~~~~
gcc -O2 -std=gnu99 -pedantic -Wextra -Wall -Wno-deprecated-declarations -Wno-overlength-strings -I"/opt/AMDAPPSDK-3.0/include"  -c -o blake.o blake.c
blake.c:26:25: warning: ‘blake2b_block_len’ defined but not used [-Wunused-const-variable=]
 static const uint32_t   blake2b_block_len = 128;
                         ^~~~~~~~~~~~~~~~~
gcc -O2 -std=gnu99 -pedantic -Wextra -Wall -Wno-deprecated-declarations -Wno-overlength-strings -I"/opt/AMDAPPSDK-3.0/include"  -c -o sha256.o sha256.c
gcc -o sa-solver main.o blake.o sha256.o -rdynamic -L"/usr/lib/x86_64-linux-gnu" -lOpenCL



and then:

Code:
kilo17@kilo-GT7:~/gatelessgate-master$ ./gatelessgate.py -c stratum+tcp://us1-zcash.flypool.org:3333 -u t1cVviFvgJinQ4w3C2m2CfRxgP5DnHYaoFC
Gateless Gate, a Zcash miner
Copyright 2016 zawawa @ bitcointalk.org
Connecting to us1-zcash.flypool.org:3333
Stratum server sent us the first job
Mining on 1 device
sr. member
Activity: 728
Merit: 304
Miner Developer
This should be pretty useful once I get down to the GCN assembly:

https://community.amd.com/thread/165710
sr. member
Activity: 728
Merit: 304
Miner Developer
I think I figured out how to coalesce global memory reads.
(Memory writes cannot be coalesced because the destination of each slot is not predictable.)
It would have been impossible with the original design of SA, but it should be possible with GG because it loads slots differently.
If everything works out, there should be a massive speedup, hehehe...

Great news!  Does that include CUDA also?

Definitely for NVIDIA, and potentially for AMD. I already have some positive results, but I still need to reduce the overhead and bring NR_ROWS_LOG down to 13. More work, more work...
legendary
Activity: 1068
Merit: 1020
I think I figured out how to coalesce global memory reads.
(Memory writes cannot be coalesced because the destination of each slot is not predictable.)
It would have been impossible with the original design of SA, but it should be possible with GG because it loads slots differently.
If everything works out, there should be a massive speedup, hehehe...

Great news!  Does that include CUDA also?
sr. member
Activity: 728
Merit: 304
Miner Developer
I think I figured out how to coalesce global memory reads.
(Memory writes cannot be coalesced because the destination of each slot is not predictable.)
It would have been impossible with the original design of SA, but it should be possible with GG because it loads slots differently.
If everything works out, there should be a massive speedup, hehehe...
sr. member
Activity: 728
Merit: 304
Miner Developer
I just pushed a workaround for fglrx to the repo.
Let me know if that works.
Still the same...Login and access credetials are the same with the 14.04 machine, you can check it out whenever you want.

EDIT: There is also something that keeps bothering me. On 16.04 clinfo recognized 14 CU out of 40 and was doing 180S/s, with fglrx all CUs were recognized correctly and the hash speed was the same. Do you think that this could be due to not fully utilizing CUs on the chip?

Got it, thanks! Let me get to that when I'm done with GTX 1060.
sr. member
Activity: 652
Merit: 266
I just pushed a workaround for fglrx to the repo.
Let me know if that works.
Still the same...Login and access credetials are the same with the 14.04 machine, you can check it out whenever you want.

EDIT: There is also something that keeps bothering me. On 16.04 clinfo recognized 14 CU out of 40 and was doing 180S/s, with fglrx all CUs were recognized correctly and the hash speed was the same. Do you think that this could be due to not fully utilizing CUs on the chip?
sr. member
Activity: 728
Merit: 304
Miner Developer
I just pushed a workaround for fglrx to the repo.
Let me know if that works.
sr. member
Activity: 728
Merit: 304
Miner Developer
If that is the issue, then it could be solved by synchronizing the kernel so all CUs are reading at the same time (copying slots to the LDS), then they all write at the same time.


One more reason to use a GCN assembler, then. How exciting!
sr. member
Activity: 728
Merit: 304
Miner Developer
For Nvidia you must solve specific memory access, duno if anyone managed to get full coalesced memory transaction which should provide max performance...

atm nicehash miner is doung 215-250sol/s on 1060 6GB (3GB version is less because GPU is crippled)

Thanks a lot for the heads up. I figured out a way to rearrange elements in the hash tables efficiently, so I can run a ton of experiments to see which access pattern would results in the best performance.
legendary
Activity: 1901
Merit: 1024
For Nvidia you must solve specific memory access, duno if anyone managed to get full coalesced memory transaction which should provide max performance...

atm nicehash miner is doung 215-250sol/s on 1060 6GB (3GB version is less because GPU is crippled)
sr. member
Activity: 463
Merit: 250
I'm getting 90 sol/s with GTX 1060... So, I can catch up with Eqminer if I reach 180 sol/s. I see.
Great!! Smiley Looking forward to 200 sols on the 1060 on your miner!! Smiley
newbie
Activity: 39
Merit: 0
Let me know if you need anyone with windows boxes to test, have a few 480s and 7970s and windows 7/10 around, and would be glad to help with testing. Thanks for doing this open source!
sr. member
Activity: 574
Merit: 250
Fighting mob law and inquisition in this forum
Good luck bro^^
But you should overrun them..and beat :-D
sr. member
Activity: 728
Merit: 304
Miner Developer
I'm getting 128 sol/s after 10 min of tweaking.
This whole thing doesn't look hard at all...
sr. member
Activity: 728
Merit: 304
Miner Developer
I'm getting 90 sol/s with GTX 1060... So, I can catch up with Eqminer if I reach 180 sol/s. I see.
sr. member
Activity: 728
Merit: 304
Miner Developer
Please do include a executable of the miner, for those of us that dont know how to compile from source. Tnx

I will do that with each point release. Both for AMD and NVIDIA now. No worries.
sr. member
Activity: 289
Merit: 250
Please do include a executable of the miner, for those of us that dont know how to compile from source. Tnx
sr. member
Activity: 728
Merit: 304
Miner Developer
I know, I know... The newly added complexity actually bothered me quite a bit and I feel bad about making you go through it, but it was necessary to ensure the correctness of the code and maximize LDS usage and thus occupancy. I feel like I have exhausted all the means of optimization at the OpenCL level except for an automatic optimizer as far as RX 480 is concerned. Once I'm done with an on-the-fly optimizer, I will delve into the GCN assembly. I have been experimenting with global synchronization with some pretty interesting results.

As for Tonga and Hawaii, I used to own a whole bunch of them, but I sold them all... I'm thinking about getting a used Nano for testing purposes.

By the way, a new GTX 1060 finally arrived, so I can optimize the miner for NVIDIA cards as well. Good stuff.
Commit 10 and 11 reporting 0S/s on R9 390 under 14.04 fglrx, although GPU usage is 100%


Well, it's fglrx... I don't think the kernel even successfully builds with it. I will implement a workaround.
sr. member
Activity: 652
Merit: 266
I know, I know... The newly added complexity actually bothered me quite a bit and I feel bad about making you go through it, but it was necessary to ensure the correctness of the code and maximize LDS usage and thus occupancy. I feel like I have exhausted all the means of optimization at the OpenCL level except for an automatic optimizer as far as RX 480 is concerned. Once I'm done with an on-the-fly optimizer, I will delve into the GCN assembly. I have been experimenting with global synchronization with some pretty interesting results.

As for Tonga and Hawaii, I used to own a whole bunch of them, but I sold them all... I'm thinking about getting a used Nano for testing purposes.

By the way, a new GTX 1060 finally arrived, so I can optimize the miner for NVIDIA cards as well. Good stuff.
Commit 10 and 11 reporting 0S/s on R9 390 under 14.04 fglrx, although GPU usage is 100%

Quote
Gateless Gate, a Zcash miner
Copyright 2016 zawawa @ bitcointalk.org
Connecting to eu1-zcash.flypool.org:3333
Solver 0.0: launching
Successfully connected to eu1-zcash.flypool.org:3333
Received target 0020c49ba5e353f7ced916872b020c49ba5e353f7ced916872b020c49ba5e353
Received job "a50e8e46b67264ee610b"
Stratum server sent us the first job
Mining on 1 device
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Total 0.0 sol/s [dev0 0.0] 0 shares
Jump to: