Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 194.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I know, I know... The newly added complexity actually bothered me quite a bit and I feel bad about making you go through it, but it was necessary to ensure the correctness of the code and maximize LDS usage and thus occupancy. I feel like I have exhausted all the means of optimization at the OpenCL level except for an automatic optimizer as far as RX 480 is concerned. Once I'm done with an on-the-fly optimizer, I will delve into the GCN assembly. I have been experimenting with global synchronization with some pretty interesting results.

As for Tonga and Hawaii, I used to own a whole bunch of them, but I sold them all... I'm thinking about getting a used Nano for testing purposes.

By the way, a new GTX 1060 finally arrived, so I can optimize the miner for NVIDIA cards as well. Good stuff.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on December 16, 2016, 04:36:01 PM

Yeah, that would be great. I just pushed an improved version of parallel writes.
It is much faster now, but it's still slower than the single thread version.

If you have access to a Tonga or Hawaii card I'd suggest testing with one of those as well. Ellesmere's sequential copy performance is much worse than Tonga and Hawaii in testing with my cl-mem utility.
https://github.com/nerdralph/cl-mem

Some of your changes could be faster on other GPUs even if they are slower on your Rx 480. The slow copy speed on Ellesmere suggests the memory controller is not batching reads and writes as well as the older parts, causing the performance to be impacted by the bus turn around time. If that is the issue, then it could be solved by synchronizing the kernel so all CUs are reading at the same time (copying slots to the LDS), then they all write at the same time.

As a general comment, your code has been getting more complicated and therefore takes more work to follow. I know sometimes you can't avoid adding complexity when you are tuning performance, but don't forget the best optimizations are the simple ones that reduce code size/complexity.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Not so fast Wink

Easy, easy...

laik2

sr. member

Activity: 652

Merit: 266

Quote from: zawawa on December 16, 2016, 04:36:01 PM

Yeah, that would be great. I just pushed an improved version of parallel writes.
It is much faster now, but it's still slower than the single thread version.

In the mean time, I will work on other optimizations.
I think I'm getting a hang of this whole thing.
I am expecting another 10-20% speedup today.
We will see.

I've turned on the linux server if you want to test.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Yeah, that would be great. I just pushed an improved version of parallel writes.
It is much faster now, but it's still slower than the single thread version.

In the mean time, I will work on other optimizations.
I think I'm getting a hang of this whole thing.
I am expecting another 10-20% speedup today.
We will see.

nerdralph

sr. member

Activity: 588

Merit: 251

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

In any case, this is the portion of the ISA in question for 2-way writes.
It seems very clean to me with a single FLAT_STORE_DWORDX4 at the end.
Maybe I need to separate reads/XOR's and writes into two sections.

Code:

label_0228:
 0x003CA0	S_ANDN2_B64	exec s[28:29] exec	 4 	 Scalar	89FE7E1C
 0x003CA4	V_MOV_B32	v28 0	 4 	 Vector ALU	7E380280
 0x003CA8	S_MOV_B64	exec s[28:29]	 4 	 Scalar	BEFE011C
 0x003CAC	S_WAITCNT	vmcnt(0)	 Varies 	 Flow Control	BF8C0F70
 0x003CB0	DS_READ_B64	v[33:34] v19 offset:7024	 Varies 	 LDS	D8EC1B70 21000013
 0x003CB8	S_WAITCNT	lgkmcnt(0)	 Varies 	 Flow Control	BF8C007F
 0x003CBC	V_CMP_NE_I64	vcc 0 v[33:34]	 Varies 	 Vector ALU	7DCA4280
 0x003CC0	S_AND_SAVEEXEC_B64	s[28:29] vcc	 4 	 Scalar	BE9C206A
 0x003CC4	S_CBRANCH_EXECZ	label_0261	 4/16 	 Branch	BF88002F
 0x003CC8	S_MOV_B32	s8 0x05040c00	 4 	 Scalar	BE8800FF 05040C00
 0x003CD0	S_MOV_B32	s30 0x0c0c000c	 4 	 Scalar	BE9E00FF 0C0C000C
 0x003CD8	V_PERM_B32	v35 v13 v43 s8	 4 	 Vector ALU	D1ED0023 0022570D
 0x003CE0	V_PERM_B32	v36 v44 v44 s30	 4 	 Vector ALU	D1ED0024 007A592C
 0x003CE8	S_MOV_B32	s8 0x04030201	 4 	 Scalar	BE8800FF 04030201
 0x003CF0	V_OR_B32	v35 v35 v36	 4 	 Vector ALU	28464923
 0x003CF4	V_PERM_B32	v8 v48 v45 s8	 4 	 Vector ALU	D1ED0008 00225B30
 0x003CFC	V_PERM_B32	v10 v39 v48 s8	 4 	 Vector ALU	D1ED000A 00226127
 0x003D04	V_PERM_B32	v39 v40 v39 s8	 4 	 Vector ALU	D1ED0027 00224F28
 0x003D0C	V_MOV_B32	v52 v35	 4 	 Vector ALU	7E680323
 0x003D10	V_MOV_B32	v53 v8	 4 	 Vector ALU	7E6A0308
 0x003D14	V_PERM_B32	v8 v29 v40 s8	 4 	 Vector ALU	D1ED0008 0022511D
 0x003D1C	V_MOV_B32	v54 v10	 4 	 Vector ALU	7E6C030A
 0x003D20	V_PERM_B32	v10 v24 v29 s8	 4 	 Vector ALU	D1ED000A 00223B18
 0x003D28	V_MOV_B32	v55 v39	 4 	 Vector ALU	7E6E0327
 0x003D2C	V_LSHRREV_B32	v24 8 v24	 4 	 Vector ALU	20303088
 0x003D30	V_MOV_B32	v56 v8	 4 	 Vector ALU	7E700308
 0x003D34	V_MOV_B32	v57 v10	 4 	 Vector ALU	7E72030A
 0x003D38	V_MOV_B32	v58 v24	 4 	 Vector ALU	7E740318
 0x003D3C	V_CMP_EQ_I32	vcc 16 v22	 4 	 Vector ALU	7D842C90
 0x003D40	V_CNDMASK_B32	v8 v52 v56 vcc	 4 	 Vector ALU	00107134
 0x003D44	V_CMP_EQ_I32	vcc 16 v22	 4 	 Vector ALU	7D842C90
 0x003D48	V_CNDMASK_B32	v10 v53 v57 vcc	 4 	 Vector ALU	00147335
 0x003D4C	V_CMP_EQ_I32	vcc 16 v22	 4 	 Vector ALU	7D842C90
 0x003D50	V_CNDMASK_B32	v45 v54 v58 vcc	 4 	 Vector ALU	005A7536
 0x003D54	V_MOV_B32	v60 v55	 4 	 Vector ALU	7E780337
 0x003D58	V_CMP_EQ_I32	vcc 16 v22	 4 	 Vector ALU	7D842C90
 0x003D5C	V_CNDMASK_B32	v48 v55 v59 vcc	 4 	 Vector ALU	00607737
 0x003D60	V_ADD_U32	v33 vcc v33 v22	 4 	 Vector ALU	32422D21
 0x003D64	V_ADDC_U32	v34 vcc v34 0 vcc	 4 	 Vector ALU	D11C6A22 01A90122
 0x003D6C	V_MOV_B32	v35 v8	 4 	 Vector ALU	7E460308
 0x003D70	V_MOV_B32	v36 v10	 4 	 Vector ALU	7E48030A
 0x003D74	V_MOV_B32	v37 v45	 4 	 Vector ALU	7E4A032D
 0x003D78	V_MOV_B32	v38 v48	 4 	 Vector ALU	7E4C0330
 0x003D7C	FLAT_STORE_DWORDX4	v[33:34] v[35:38]

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on December 16, 2016, 12:13:32 PM

Quote from: zawawa on December 16, 2016, 11:48:50 AM

Quote from: nerdralph on December 16, 2016, 11:24:47 AM

I'm using Ubuntu 14.04 & fglrx. If it builds OK for you I'm surprised. I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.

Ah, it must be the driver, then. That makes a perfect sense as fglrx was a nightmare to deal with.
The code runs perfectly fine with Crimson drivers.
It's not "__local __global", but a "pointer to a global object stored in local memory," so I don't see anything wrong with that.
It's good to know the code is not compatible with fglrx, though.

I know how you are intending to declare slot_ptrs. What I'm saying is it is not valid syntax.
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/global.html

I don't think the specs prohibit the syntax. See:

https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/opencl-c/
http://stackoverflow.com/questions/11978024/opencl-store-pointer-to-global-memory-in-local-memory

Code:

__global char * __local lgc[8];  // 8 pointers stored on the local memory that points to a char located on the global memory

I appreciate your detail-oriented approach, though Wink

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on December 16, 2016, 11:48:50 AM

Quote from: nerdralph on December 16, 2016, 11:24:47 AM

I'm using Ubuntu 14.04 & fglrx. If it builds OK for you I'm surprised. I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.

Ah, it must be the driver, then. That makes a perfect sense as fglrx was a nightmare to deal with.
The code runs perfectly fine with Crimson drivers.
It's not "__local __global", but a "pointer to a global object stored in local memory," so I don't see anything wrong with that.
It's good to know the code is not compatible with fglrx, though.

I know how you are intending to declare slot_ptrs. What I'm saying is it is not valid syntax.
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/global.html

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on December 16, 2016, 11:24:47 AM

Quote from: zawawa on December 16, 2016, 11:06:08 AM

Quote from: nerdralph on December 16, 2016, 10:51:47 AM

Quote from: zawawa on December 16, 2016, 04:10:00 AM

Quote from: nerdralph on December 16, 2016, 12:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks. I'll certainly give you credit for pumping the code out faster than I do. I thing you forgot to push something to the repo though:

Code:

"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^

That's very strange... laik2 was able to build the latest version without problems.
I'm using Win 10, Crimson 16.11.2, and RX 480, and laik2 is using Ubuntu 16.04 LTS. What are yours?
By the way, "make test" may be broken as I don't use it on Windows.

I'm using Ubuntu 14.04 & fglrx. If it builds OK for you I'm surprised. I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.

Ah, it must be the driver, then. That makes a perfect sense as fglrx was a nightmare to deal with.
The code runs perfectly fine with Crimson drivers.
It's not "__local __global", but a "pointer to a global object stored in local memory," so I don't see anything wrong with that.
It's good to know the code is not compatible with fglrx, though.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on December 16, 2016, 11:06:08 AM

Quote from: nerdralph on December 16, 2016, 10:51:47 AM

Quote from: zawawa on December 16, 2016, 04:10:00 AM

Quote from: nerdralph on December 16, 2016, 12:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks. I'll certainly give you credit for pumping the code out faster than I do. I thing you forgot to push something to the repo though:

Code:

"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^

That's very strange... laik2 was able to build the latest version without problems.
I'm using Win 10, Crimson 16.11.2, and RX 480, and laik2 is using Ubuntu 16.04 LTS. What are yours?
By the way, "make test" may be broken as I don't use it on Windows.

I'm using Ubuntu 14.04 & fglrx. If it builds OK for you I'm surprised. I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on December 16, 2016, 10:51:47 AM

Quote from: zawawa on December 16, 2016, 04:10:00 AM

Quote from: nerdralph on December 16, 2016, 12:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks. I'll certainly give you credit for pumping the code out faster than I do. I thing you forgot to push something to the repo though:

Code:

"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^

That's very strange... laik2 was able to build the latest version without problems.
I'm using Win 10, Crimson 16.11.2, and RX 480, and laik2 is using Ubuntu 16.04 LTS. What are yours?
By the way, "make test" may be broken as I don't use it on Windows.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on December 16, 2016, 04:10:00 AM

Quote from: nerdralph on December 16, 2016, 12:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks. I'll certainly give you credit for pumping the code out faster than I do. I thing you forgot to push something to the repo though:

Code:

"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^

p.s. even with 1 thread per write, although it builds, no solutions are found (make test fails).

p.p.s I tried going back to v0.0.1, but it seems I also need to merge back the unix compile fixes first...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I just ordered GTX 1060, and I haven't told my wife about it yet.
Donations are always welcome, guys! My BTC address is in my signature.

laik2

sr. member

Activity: 652

Merit: 266

Quote from: zawawa on December 16, 2016, 09:15:57 AM

Like I said before, parallel writes are slower than single thread writes at this point.
This is still an experimental feature.

Ok

I'm just giving some feedback.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Like I said before, parallel writes are slower than single thread writes at this point.
This is still an experimental feature.

laik2

sr. member

Activity: 652

Merit: 266

Quote from: zawawa on December 16, 2016, 04:40:30 AM

Quote from: laik2 on December 16, 2016, 04:24:09 AM

Quote from: zawawa on December 16, 2016, 04:10:00 AM

Quote from: nerdralph on December 16, 2016, 12:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

It's slower, you can check for yourself I've updated repo on the test linux.

Are you referring to multi-threaded writes, or the default settings?

1 - 168/170S/s
2 - 148/150S/s
4 - 128/130S/s
8 - 82/84S/s

Changing threads in param.h
So basicly there is no change except that multithreading doesn't seem to work under linux as supposed to.

EDIT: -t value has no effect as of "THREADS_PER_WRITE" , it has to be hardcoded in param.h and recompiled to have effect.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: laik2 on December 16, 2016, 04:24:09 AM

Quote from: zawawa on December 16, 2016, 04:10:00 AM

Quote from: nerdralph on December 16, 2016, 12:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

It's slower, you can check for yourself I've updated repo on the test linux.

Are you referring to multi-threaded writes, or the default settings?

laik2

sr. member

Activity: 652

Merit: 266

Quote from: zawawa on December 16, 2016, 04:10:00 AM

Quote from: nerdralph on December 16, 2016, 12:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

It's slower, you can check for yourself I've updated repo on the test linux.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: QuintLeo on December 16, 2016, 04:08:07 AM

Quote from: zawawa on December 15, 2016, 04:14:51 PM

Quote from: Vetal_inside on December 15, 2016, 02:56:28 PM

R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...

15.12 is the original Crimson driver - for the pre-RX cards it's the best and fastest version overall per everything I've ever run it on (quite a wide assortment).

It was also the last LINUX version that supported pre-GCN cards (Windows had to suffer with 15.7.1 though there is a "legacy" 16.2 version that basically repackaged 15.7.1 with some of the newer bells and whistles) but offered no performance advantage).

16.9.2 or 16.10.1 seem to be the best mining options for the RX series cards (16.10.1 is WQHL seems to be the only real difference between those two for miners).
They also seem to work as well with the R9 and HD 7xxx series GCN cards in my somewhat limited testing.

16.12.1 is total bloated junk and reduced hashrate 5-10% on EVERYTHING I tried it on (HD7870, R9 280x, RX 470).
Avoid it.

I would suggest that you make the 15.12 for pre-RX cards and the 16.10.1 for RX series your "tested with and recommended" driver options.

(This will of course change when Vega hits the street and requires newer drivers for support).

Thanks a lot for the great suggestion. I will definitely consider that.

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 194. (Read 214463 times)