Author

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 173. (Read 214458 times)

sr. member
Activity: 291
Merit: 250
im still confused on how to build it on ubuntu 16.04 Smiley))

i got the source and compiled it ... running it i get sgminer 5.5.4 and same hashrate im getting from genesis sgminer 5.5.5 ... R9-390x: 285 h/s and RX470: 188 h/s?!
sr. member
Activity: 728
Merit: 304
Miner Developer
1.3-pre0

Preliminary Testing
my tahiti 7870xt -I 10
GG = 139-142h/s very random 5 Second spikes up-to 160h/s+
Claymore = 140-148h/s - dev fee = 137-145h/s

So it's very very close on tahiti

Will test on 270x and rx 470 4GB Modded Later

So nerdralph is right, then. I'm almost catching up, huh.

Quote
On all AMD Radeon HD 79XX-series GPUs, there are 12 channels. A crossbar distributes the load to the appropriate memory channel. Each memory channel has a read/write global L2 cache, with 64 kB per channel. The cache line size is 64 bytes. . . . On AMD Radeon HD 78XX GPUs, the channel selection are bits 10:8 of the byte address. For the AMD Radeon  HD 77XX, the channel selection are bits 9:8 of the byte address. This means a linear burst switches channels every 256 bytes.
http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/
sr. member
Activity: 450
Merit: 255
1.3-pre0

Preliminary Testing
my tahiti 7870xt -I 10
GG = 139-142h/s very random 5 Second spikes up-to 160h/s+
Claymore = 140-148h/s - dev fee = 137-145h/s

So it's very very close on tahiti

Will test on 270x and rx 470 4GB Modded Later
sr. member
Activity: 588
Merit: 251
I urgently need to set up clang/llvm as I feel like losing hairs dealing with disassembled codes...

If you were using Ubuntu for development setup it's pretty easy.

Code:
miner@l1:~/bin$ apt-cache policy clang-3.9
clang-3.9:
  Installed: 1:3.9~svn288847-1~exp1
  Candidate: 1:3.9~svn288847-1~exp1
  Version table:
 *** 1:3.9~svn288847-1~exp1 0
        500 http://llvm.org/apt/trusty/ llvm-toolchain-trusty-3.9/main amd64 Packages

You'll also probably want libclc-dev.
sr. member
Activity: 728
Merit: 304
Miner Developer
I just uploaded a new pre-release:

https://github.com/zawawawa/gatelessgate/releases/tag/v0.1.3-pre0

The new assembly version is for GCN1 and Windows only for now.
I will work on the Linux version today.
As always, I appreciate your feedback, donations, and even stars on GitHub. Enjoy!
sr. member
Activity: 728
Merit: 304
Miner Developer
My tentative impression is that GDS is much slower than your descriptions, though.
Do you have any actual numbers from your experiments with GDS?
I began to wonder Claymore's and Optiminer's optimizations have more to do with GWS instructions, which do depend on GDS, instead of GDS proper.

I've done very little coding in the last few weeks due to other priorities.
I did look back over the docs, and they clearly state, "The GDS is identical to the local data shares, except that it is shared by all compute units".  Access is through the export unit instead of local to the CU, but I haven't seen a single reference that suggests that adds any latency.  Since the (LDS/GDS) are independent, it increases the potential for concurrency since a LDS and a GDS instruction can be concurrently dispatched.

Although I haven't looked at Optiminer's asm code yet, I'd wager that the primary gains are from GDS.  As I've previously pointed out, the core clock is the limit when the row counters are in L2.  With 32 bytes of data per slot, it's impossible to process 8 slots in anything less than 5 core clocks (I'm talking Rx and other devices with 4 memory channels).  2 of those are for updating the row counters, so using the GDS allows you to get that down to 3.  It looks like ds_add_rtn_u32 takes 2 clocks (vs 1 for ds_read_b32), but that still allows you to update 32 row counters in 2 cycles.

Its unclear how GDS arbitration is done, so my guess is conflicts for GDS access between CUs is exacerbating the problem.  Specifically I suspect GDS access by one CU blocks access by all other CUs, even if the other CUs are attempting to access idle banks in the GDS.  I'd suggest using a 4-way (or even Cool xor_and_store and 64 local-work size.  With a 8-way store, 32 work-items will usually (~80% of the time) hit 4 banks with no conflicts.  With a 4-way store, 87.5% of the time you'll get a bank conflict causing the ds_add to take 4 or more cycles instead of 2 (though 12.5% of the time you'll update 8 counters in 2 cycles).



Very insightful. Thank you so much! I will start with 4-way writes and 128 work-items and see how that would change things around. In the mean time, I urgently need to set up clang/llvm as I feel like losing hairs dealing with disassembled codes...
sr. member
Activity: 588
Merit: 251
It seems like the speedup with GDS counters would be a little over 10%, which is consistent with what I observed with the ASM and non-ASM versions of Claymore's. There are other really neat tricks with the GCN assembly, but I don't think Claymore used them. (His real strength is at the algorithmic level anyway.) I will now work on GDS counters for RX 480 on Linux, and I will prepare a next point release when I'm done.

Maybe 10% is all you can get for Tahiti, but for Ellesmere you should be able to get a 20-25% boost.  Even if Optiminer has a reduced dev fee of 5%, the gross speed for the Rx 480 would be 270/.95 = 284.
sr. member
Activity: 588
Merit: 251
I am sick and tired of inserting GDS instructions to 6000 lines of a disassembled GCN code every time I update the OpenCL kernel. Let's see if I can use the GCN inline assembly with LLVM...

With clang/llvm 3.9, generating asm from OpenCL + inline asm was pretty easy:
Code:
${CLANG} -x cl -Xclang -finclude-default-header -Dcl_clang_storage_class_specifiers -target amdgcn -mcpu=tonga -S -o ${f}.s ${f}.cl

linking with libclc was where I ran into problems.
sr. member
Activity: 588
Merit: 251
My tentative impression is that GDS is much slower than your descriptions, though.
Do you have any actual numbers from your experiments with GDS?
I began to wonder Claymore's and Optiminer's optimizations have more to do with GWS instructions, which do depend on GDS, instead of GDS proper.

I've done very little coding in the last few weeks due to other priorities.
I did look back over the docs, and they clearly state, "The GDS is identical to the local data shares, except that it is shared by all compute units".  Access is through the export unit instead of local to the CU, but I haven't seen a single reference that suggests that adds any latency.  Since the (LDS/GDS) are independent, it increases the potential for concurrency since a LDS and a GDS instruction can be concurrently dispatched.

Although I haven't looked at Optiminer's asm code yet, I'd wager that the primary gains are from GDS.  As I've previously pointed out, the core clock is the limit when the row counters are in L2.  With 32 bytes of data per slot, it's impossible to process 8 slots in anything less than 5 core clocks (I'm talking Rx and other devices with 4 memory channels).  2 of those are for updating the row counters, so using the GDS allows you to get that down to 3.  It looks like ds_add_rtn_u32 takes 2 clocks (vs 1 for ds_read_b32), but that still allows you to update 32 row counters in 2 cycles.

Its unclear how GDS arbitration is done, so my guess is conflicts for GDS access between CUs is exacerbating the problem.  Specifically I suspect GDS access by one CU blocks access by all other CUs, even if the other CUs are attempting to access idle banks in the GDS.  I'd suggest using a 4-way (or even Cool xor_and_store and 64 local-work size.  With a 8-way store, 32 work-items will usually (~80% of the time) hit 4 banks with no conflicts.  With a 4-way store, 87.5% of the time you'll get a bank conflict causing the ds_add to take 4 or more cycles instead of 2 (though 12.5% of the time you'll update 8 counters in 2 cycles).

hero member
Activity: 906
Merit: 507
I wanted to see what rates I could get with my 390/290 and 280s has anyone figured out good settings or still to early

I will upload a new version tomorrow, so we should wait until then.
Thanks I will also use the bat file the was posted to mine some to you for great work your doing
sr. member
Activity: 728
Merit: 304
Miner Developer
It seems like the speedup with GDS counters would be a little over 10%, which is consistent with what I observed with the ASM and non-ASM versions of Claymore's. There are other really neat tricks with the GCN assembly, but I don't think Claymore used them. (His real strength is at the algorithmic level anyway.) I will now work on GDS counters for RX 480 on Linux, and I will prepare a next point release when I'm done.
sr. member
Activity: 728
Merit: 304
Miner Developer
I am sick and tired of inserting GDS instructions to 6000 lines of a disassembled GCN code every time I update the OpenCL kernel. Let's see if I can use the GCN inline assembly with LLVM...
sr. member
Activity: 728
Merit: 304
Miner Developer
I wanted to see what rates I could get with my 390/290 and 280s has anyone figured out good settings or still to early

I will upload a new version tomorrow, so we should wait until then.
hero member
Activity: 906
Merit: 507
I wanted to see what rates I could get with my 390/290 and 280s has anyone figured out good settings or still to early
sr. member
Activity: 728
Merit: 304
Miner Developer
Hm, let me try atomic_inc, then. Thanks a lot for the pointers!
My tentative impression is that GDS is much slower than your descriptions, though.
Do you have any actual numbers from your experiments with GDS?
I began to wonder Claymore's and Optiminer's optimizations have more to do with GWS instructions, which do depend on GDS, instead of GDS proper.
sr. member
Activity: 588
Merit: 251
Now I think about it, it is no wonder that bank conflicts would be a serious problem considering the fact that GDS'es 32 banks are shared across all the compute units unlike LDS. My next game plan is to reduce the number of wavefronts to avoid bank conflicts in GDS. We will see.

But that's still much better than 4 or 8 memory channels when the row counters are stored in RAM/L2.  On Hawaii with 8 memory channels you should be able to do 4 GDS updates for every write to RAM.

Given the architectural description of the GDS is a bit vague, it's possible that the atomic units only support single-cycle increment/decrement, while add might lock the GDS for two or three cycles to do read/add/write.  Even then, your bandwidth limit should still be the external memory channels.
sr. member
Activity: 588
Merit: 251
Looks like GG is not taking advantage of the available memory bandwidth of 7990. No wonder...



Tahiti is quirky with 6 memory channels, making memory stride calculations more complicated than 4 or 8 channels.
sr. member
Activity: 728
Merit: 304
Miner Developer
I think I need to reintroduce the variable NR_ROWS_LOG.
I dropped it when I switched to sgminer-gm because the code became too complex and the AMD OpenCL driver crapped out.
I need to keep the code really simple this time around.
sr. member
Activity: 728
Merit: 304
Miner Developer
Looks like GG is not taking advantage of the available memory bandwidth of 7990. No wonder...

sr. member
Activity: 728
Merit: 304
Miner Developer
This is a profile of the non-GDS version. This looks OK...

Jump to: