Author

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 194. (Read 214410 times)

sr. member
Activity: 728
Merit: 304
Miner Developer
I know, I know... The newly added complexity actually bothered me quite a bit and I feel bad about making you go through it, but it was necessary to ensure the correctness of the code and maximize LDS usage and thus occupancy. I feel like I have exhausted all the means of optimization at the OpenCL level except for an automatic optimizer as far as RX 480 is concerned. Once I'm done with an on-the-fly optimizer, I will delve into the GCN assembly. I have been experimenting with global synchronization with some pretty interesting results.

As for Tonga and Hawaii, I used to own a whole bunch of them, but I sold them all... I'm thinking about getting a used Nano for testing purposes.

By the way, a new GTX 1060 finally arrived, so I can optimize the miner for NVIDIA cards as well. Good stuff.
sr. member
Activity: 588
Merit: 251
Yeah, that would be great. I just pushed an improved version of parallel writes.
It is much faster now, but it's still slower than the single thread version.

If you have access to a Tonga or Hawaii card I'd suggest testing with one of those as well.  Ellesmere's sequential copy performance is much worse than Tonga and Hawaii in testing with my cl-mem utility.
https://github.com/nerdralph/cl-mem

Some of your changes could be faster on other GPUs even if they are slower on your Rx 480.  The slow copy speed on Ellesmere suggests the memory controller is not batching reads and writes as well as the older parts, causing the performance to be impacted by the bus turn around time.  If that is the issue, then it could be solved by synchronizing the kernel so all CUs are reading at the same time (copying slots to the LDS), then they all write at the same time.

As a general comment, your code has been getting more complicated and therefore takes more work to follow.  I know sometimes you can't avoid adding complexity when you are tuning performance, but don't forget the best optimizations are the simple ones that reduce code size/complexity.
sr. member
Activity: 728
Merit: 304
Miner Developer
Not so fast  Wink Easy, easy...
sr. member
Activity: 652
Merit: 266
Yeah, that would be great. I just pushed an improved version of parallel writes.
It is much faster now, but it's still slower than the single thread version.

In the mean time, I will work on other optimizations.
I think I'm getting a hang of this whole thing.
I am expecting another 10-20% speedup today.
We will see.
I've turned on the linux server if you want to test.
sr. member
Activity: 728
Merit: 304
Miner Developer
Yeah, that would be great. I just pushed an improved version of parallel writes.
It is much faster now, but it's still slower than the single thread version.

In the mean time, I will work on other optimizations.
I think I'm getting a hang of this whole thing.
I am expecting another 10-20% speedup today.
We will see.
sr. member
Activity: 588
Merit: 251
sr. member
Activity: 728
Merit: 304
Miner Developer
In any case, this is the portion of the ISA in question for 2-way writes.
It seems very clean to me with a single FLAT_STORE_DWORDX4 at the end.
Maybe I need to separate reads/XOR's and writes into two sections.

Code:
label_0228:
 0x003CA0 S_ANDN2_B64 exec s[28:29] exec 4 Scalar 89FE7E1C
 0x003CA4 V_MOV_B32 v28 0 4 Vector ALU 7E380280
 0x003CA8 S_MOV_B64 exec s[28:29] 4 Scalar BEFE011C
 0x003CAC S_WAITCNT vmcnt(0) Varies Flow Control BF8C0F70
 0x003CB0 DS_READ_B64 v[33:34] v19 offset:7024 Varies LDS D8EC1B70 21000013
 0x003CB8 S_WAITCNT lgkmcnt(0) Varies Flow Control BF8C007F
 0x003CBC V_CMP_NE_I64 vcc 0 v[33:34] Varies Vector ALU 7DCA4280
 0x003CC0 S_AND_SAVEEXEC_B64 s[28:29] vcc 4 Scalar BE9C206A
 0x003CC4 S_CBRANCH_EXECZ label_0261 4/16 Branch BF88002F
 0x003CC8 S_MOV_B32 s8 0x05040c00 4 Scalar BE8800FF 05040C00
 0x003CD0 S_MOV_B32 s30 0x0c0c000c 4 Scalar BE9E00FF 0C0C000C
 0x003CD8 V_PERM_B32 v35 v13 v43 s8 4 Vector ALU D1ED0023 0022570D
 0x003CE0 V_PERM_B32 v36 v44 v44 s30 4 Vector ALU D1ED0024 007A592C
 0x003CE8 S_MOV_B32 s8 0x04030201 4 Scalar BE8800FF 04030201
 0x003CF0 V_OR_B32 v35 v35 v36 4 Vector ALU 28464923
 0x003CF4 V_PERM_B32 v8 v48 v45 s8 4 Vector ALU D1ED0008 00225B30
 0x003CFC V_PERM_B32 v10 v39 v48 s8 4 Vector ALU D1ED000A 00226127
 0x003D04 V_PERM_B32 v39 v40 v39 s8 4 Vector ALU D1ED0027 00224F28
 0x003D0C V_MOV_B32 v52 v35 4 Vector ALU 7E680323
 0x003D10 V_MOV_B32 v53 v8 4 Vector ALU 7E6A0308
 0x003D14 V_PERM_B32 v8 v29 v40 s8 4 Vector ALU D1ED0008 0022511D
 0x003D1C V_MOV_B32 v54 v10 4 Vector ALU 7E6C030A
 0x003D20 V_PERM_B32 v10 v24 v29 s8 4 Vector ALU D1ED000A 00223B18
 0x003D28 V_MOV_B32 v55 v39 4 Vector ALU 7E6E0327
 0x003D2C V_LSHRREV_B32 v24 8 v24 4 Vector ALU 20303088
 0x003D30 V_MOV_B32 v56 v8 4 Vector ALU 7E700308
 0x003D34 V_MOV_B32 v57 v10 4 Vector ALU 7E72030A
 0x003D38 V_MOV_B32 v58 v24 4 Vector ALU 7E740318
 0x003D3C V_CMP_EQ_I32 vcc 16 v22 4 Vector ALU 7D842C90
 0x003D40 V_CNDMASK_B32 v8 v52 v56 vcc 4 Vector ALU 00107134
 0x003D44 V_CMP_EQ_I32 vcc 16 v22 4 Vector ALU 7D842C90
 0x003D48 V_CNDMASK_B32 v10 v53 v57 vcc 4 Vector ALU 00147335
 0x003D4C V_CMP_EQ_I32 vcc 16 v22 4 Vector ALU 7D842C90
 0x003D50 V_CNDMASK_B32 v45 v54 v58 vcc 4 Vector ALU 005A7536
 0x003D54 V_MOV_B32 v60 v55 4 Vector ALU 7E780337
 0x003D58 V_CMP_EQ_I32 vcc 16 v22 4 Vector ALU 7D842C90
 0x003D5C V_CNDMASK_B32 v48 v55 v59 vcc 4 Vector ALU 00607737
 0x003D60 V_ADD_U32 v33 vcc v33 v22 4 Vector ALU 32422D21
 0x003D64 V_ADDC_U32 v34 vcc v34 0 vcc 4 Vector ALU D11C6A22 01A90122
 0x003D6C V_MOV_B32 v35 v8 4 Vector ALU 7E460308
 0x003D70 V_MOV_B32 v36 v10 4 Vector ALU 7E48030A
 0x003D74 V_MOV_B32 v37 v45 4 Vector ALU 7E4A032D
 0x003D78 V_MOV_B32 v38 v48 4 Vector ALU 7E4C0330
 0x003D7C FLAT_STORE_DWORDX4 v[33:34] v[35:38]
sr. member
Activity: 728
Merit: 304
Miner Developer
I'm using Ubuntu 14.04 & fglrx.  If it builds OK for you I'm surprised.  I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.


Ah, it must be the driver, then. That makes a perfect sense as fglrx was a nightmare to deal with.
The code runs perfectly fine with Crimson drivers.
It's not "__local __global", but a "pointer to a global object stored in local memory," so I don't see anything wrong with that.
It's good to know the code is not compatible with fglrx, though.

I know how you are intending to declare slot_ptrs.  What I'm saying is it is not valid syntax.
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/global.html


I don't think the specs prohibit the syntax. See:

https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/opencl-c/
http://stackoverflow.com/questions/11978024/opencl-store-pointer-to-global-memory-in-local-memory

Code:
__global char * __local lgc[8];  // 8 pointers stored on the local memory that points to a char located on the global memory

I appreciate your detail-oriented approach, though  Wink
sr. member
Activity: 588
Merit: 251
I'm using Ubuntu 14.04 & fglrx.  If it builds OK for you I'm surprised.  I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.


Ah, it must be the driver, then. That makes a perfect sense as fglrx was a nightmare to deal with.
The code runs perfectly fine with Crimson drivers.
It's not "__local __global", but a "pointer to a global object stored in local memory," so I don't see anything wrong with that.
It's good to know the code is not compatible with fglrx, though.

I know how you are intending to declare slot_ptrs.  What I'm saying is it is not valid syntax.
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/global.html
sr. member
Activity: 728
Merit: 304
Miner Developer
Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks.  I'll certainly give you credit for pumping the code out faster than I do.  I thing you forgot to push something to the repo though:
Code:
"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^


That's very strange... laik2 was able to build the latest version without problems.
I'm using Win 10, Crimson 16.11.2, and RX 480, and laik2 is using Ubuntu 16.04 LTS. What are yours?
By the way, "make test" may be broken as I don't use it on Windows.

I'm using Ubuntu 14.04 & fglrx.  If it builds OK for you I'm surprised.  I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.


Ah, it must be the driver, then. That makes a perfect sense as fglrx was a nightmare to deal with.
The code runs perfectly fine with Crimson drivers.
It's not "__local __global", but a "pointer to a global object stored in local memory," so I don't see anything wrong with that.
It's good to know the code is not compatible with fglrx, though.
sr. member
Activity: 588
Merit: 251
Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks.  I'll certainly give you credit for pumping the code out faster than I do.  I thing you forgot to push something to the repo though:
Code:
"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^


That's very strange... laik2 was able to build the latest version without problems.
I'm using Win 10, Crimson 16.11.2, and RX 480, and laik2 is using Ubuntu 16.04 LTS. What are yours?
By the way, "make test" may be broken as I don't use it on Windows.

I'm using Ubuntu 14.04 & fglrx.  If it builds OK for you I'm surprised.  I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.
sr. member
Activity: 728
Merit: 304
Miner Developer
Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks.  I'll certainly give you credit for pumping the code out faster than I do.  I thing you forgot to push something to the repo though:
Code:
"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^


That's very strange... laik2 was able to build the latest version without problems.
I'm using Win 10, Crimson 16.11.2, and RX 480, and laik2 is using Ubuntu 16.04 LTS. What are yours?
By the way, "make test" may be broken as I don't use it on Windows.
sr. member
Activity: 588
Merit: 251
Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks.  I'll certainly give you credit for pumping the code out faster than I do.  I thing you forgot to push something to the repo though:
Code:
"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^

p.s. even with 1 thread per write, although it builds, no solutions are found (make test fails).

p.p.s I tried going back to v0.0.1, but it seems I also need to merge back the unix compile fixes first...
sr. member
Activity: 728
Merit: 304
Miner Developer
I just ordered GTX 1060, and I haven't told my wife about it yet.
Donations are always welcome, guys! My BTC address is in my signature.
sr. member
Activity: 652
Merit: 266
Like I said before, parallel writes are slower than single thread writes at this point.
This is still an experimental feature.
Ok Smiley
I'm just giving some feedback.
sr. member
Activity: 728
Merit: 304
Miner Developer
Like I said before, parallel writes are slower than single thread writes at this point.
This is still an experimental feature.
sr. member
Activity: 652
Merit: 266
Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...
It's slower, you can check for yourself I've updated repo on the test linux.

Are you referring to multi-threaded writes, or the default settings?
1 - 168/170S/s
2 - 148/150S/s
4 - 128/130S/s
8 - 82/84S/s

Changing threads in param.h
So basicly there is no change except that multithreading doesn't seem to work under linux as supposed to.

EDIT: -t value has no effect as of "THREADS_PER_WRITE" , it has to be hardcoded in param.h and recompiled to have effect.
sr. member
Activity: 728
Merit: 304
Miner Developer
Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...
It's slower, you can check for yourself I've updated repo on the test linux.

Are you referring to multi-threaded writes, or the default settings?
sr. member
Activity: 652
Merit: 266
Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...
It's slower, you can check for yourself I've updated repo on the test linux.
sr. member
Activity: 728
Merit: 304
Miner Developer
R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...

15.12 is the original Crimson driver - for the pre-RX cards it's the best and fastest version overall per everything I've ever run it on (quite a wide assortment).

 It was also the last LINUX version that supported pre-GCN cards (Windows had to suffer with 15.7.1 though there is a "legacy" 16.2 version that basically repackaged 15.7.1 with some of the newer bells and whistles) but offered no performance advantage).

 16.9.2 or 16.10.1 seem to be the best mining options for the RX series cards (16.10.1 is WQHL seems to be the only real difference between those two for miners).
 They also seem to work as well with the R9 and HD 7xxx series GCN cards in my somewhat limited testing.

 16.12.1 is total bloated junk and reduced hashrate 5-10% on EVERYTHING I tried it on (HD7870, R9 280x, RX 470).
 Avoid it.



 I would suggest that you make the 15.12 for pre-RX cards and the 16.10.1 for RX series your "tested with and recommended" driver options.

 (This will of course change when Vega hits the street and requires newer drivers for support).






Thanks a lot for the great suggestion. I will definitely consider that.
Jump to: