Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 195.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on December 16, 2016, 12:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

QuintLeo

legendary

Activity: 1498

Merit: 1030

Quote from: zawawa on December 15, 2016, 04:14:51 PM

Quote from: Vetal_inside on December 15, 2016, 02:56:28 PM

R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...

15.12 is the original Crimson driver - for the pre-RX cards it's the best and fastest version overall per everything I've ever run it on (quite a wide assortment).

It was also the last LINUX version that supported pre-GCN cards (Windows had to suffer with 15.7.1 though there is a "legacy" 16.2 version that basically repackaged 15.7.1 with some of the newer bells and whistles) but offered no performance advantage).

16.9.2 or 16.10.1 seem to be the best mining options for the RX series cards (16.10.1 is WQHL seems to be the only real difference between those two for miners).
They also seem to work as well with the R9 and HD 7xxx series GCN cards in my somewhat limited testing.

16.12.1 is total bloated junk and reduced hashrate 5-10% on EVERYTHING I tried it on (HD7870, R9 280x, RX 470).
Avoid it.

I would suggest that you make the 15.12 for pre-RX cards and the 16.10.1 for RX series your "tested with and recommended" driver options.

(This will of course change when Vega hits the street and requires newer drivers for support).

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: zawawa on December 15, 2016, 10:52:25 PM

Quote from: bigchirv on December 15, 2016, 10:31:09 PM

Thanks for publishing your repo! Appreciated.

I'm not a C programmer (or OpenCL for the matter) but I'm a fan of DRY; so when I was reading input.cl I found the get_row() function and I think we can make it a little bit DRYer by doing something like this:

Code:

uint get_row(uint round, uint xi0)
{
  uint           row;
  uint           swp;
  uint           num;
#if NR_ROWS_LOG == 14
  swp = 0;
#elif NR_ROWS_LOG == 15
  swp = 1;
#elif NR_ROWS_LOG == 16
  swp = 2;
#else
#error "unsupported NR_ROWS_LOG"
#endif
  num = (40 << swp) - 1);
  if (!(round % 2))
    row = (xi0 & ((num << 8 | 0xff));
  else
    row = ((xi0 & (num << 16 | 0xf00)) >> 8) | ((xi0 & 0xf0000000) >> 24);
  return row;
}

So, what do you think, @zawawa?

I don't know if this can be useful at all, but if you like it I can make a PR so you can merge the changes later.

I appreciate your enthusiasm and willingness to help, but I will keep the current code. With GPGPU, and especially with AMD OpenCL drivers, repeats are often better because you can keep register usage low that way, which is crucially important. My general approach toward GPGPU is that I sacrifice everything for performance, including readability.

Actually, let me clean that up. It may be nice.

EDIT:

This could be cleaned further, but the use of the ternary operator encourages the compiler to use v_cndmask_b32 instead of branching.

Code:

uint get_row(uint round, uint xi0)
{
	uint swp;
	uint num;
	
	#if NR_ROWS_LOG == 14
	swp = 0;
	#elif NR_ROWS_LOG == 15
	swp = 1;
	#elif NR_ROWS_LOG == 16
	swp = 2;
	#else
	#error "unsupported NR_ROWS_LOG"
	#endif
	num = (40 << swp) - 1);
	
	return((round & 1) ? (((xi0 & (num << 16 | 0xf00)) >> 8) | ((xi0 & 0xf0000000) >> 24)) : (xi0 & ((num << 8 | 0xff)));	
}

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on December 15, 2016, 10:10:03 PM

Quote from: nerdralph on December 15, 2016, 06:31:30 PM

Not bad zawawa. You still have room to improve ht_store.

Code:

p = slot.ui8

Will at best result in 2 store_dwordx4 instructions, and 2 core cycles to the memory controller.

Code:

p = slot.ui4[0]

Which you use after round 5 should only be one cycle, but it will force a 32-byte read burst from the GDDR into the L2, modification of 16 bytes, and then write back. This will waste a lot of GDDR cycles due to the bus turnaround delay. The solution is to have a n-way operation where n threads write 32/n bytes. That will be just core one cycle to xfer 32 bytes to the L2, and a single 32-byte write burst to one of the 2 GDDR5 chips per memory controller channel.
I also think using an odd number for NR_SLOTS should be a tiny bit faster by balancing out the writes between the odd and even memory chips. With NR_SLOTS even, the first write to a given row will always be to an even memory chip. With more slots per row this becomes less significant because the rows don't fill up equally. Using an odd number for NR_SLOTS may also reduce channel conflicts.

I tried 4-way writes with mixed results. The 4-way write version was actually slower than the single-thread-write version, but the former seems to speed up the last few rounds. It makes sense as these rounds are more memory-intensive. I will explore this approach further.

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2? Branching code where each thread executes a different store instruction will make performance worse because the L1 cache is write-thu, resulting in 4 cache lines written to the L2 with 8 dirty bytes each instead of 1 cache line written with 32 dirty bytes. If you send me your 4-way code I can take a look. It should even be possible to do it as a 2-way where the thread pairs execute a store_dwordx4.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: bigchirv on December 15, 2016, 10:31:09 PM

Thanks for publishing your repo! Appreciated.

I'm not a C programmer (or OpenCL for the matter) but I'm a fan of DRY; so when I was reading input.cl I found the get_row() function and I think we can make it a little bit DRYer by doing something like this:

Code:

uint get_row(uint round, uint xi0)
{
  uint           row;
  uint           swp;
  uint           num;
#if NR_ROWS_LOG == 14
  swp = 0;
#elif NR_ROWS_LOG == 15
  swp = 1;
#elif NR_ROWS_LOG == 16
  swp = 2;
#else
#error "unsupported NR_ROWS_LOG"
#endif
  num = (40 << swp) - 1);
  if (!(round % 2))
    row = (xi0 & ((num << 8 | 0xff));
  else
    row = ((xi0 & (num << 16 | 0xf00)) >> 8) | ((xi0 & 0xf0000000) >> 24);
  return row;
}

So, what do you think, @zawawa?

I don't know if this can be useful at all, but if you like it I can make a PR so you can merge the changes later.

I appreciate your enthusiasm and willingness to help, but I will keep the current code. With GPGPU, and especially with AMD OpenCL drivers, repeats are often better because you can keep register usage low that way, which is crucially important. My general approach toward GPGPU is that I sacrifice everything for performance, including readability.

bigchirv

newbie

Activity: 19

Merit: 0

Thanks for publishing your repo! Appreciated.

I'm not a C programmer (or OpenCL for the matter) but I'm a fan of DRY; so when I was reading input.cl I found the get_row() function and I think we can make it a little bit DRYer by doing something like this:

Code:

uint get_row(uint round, uint xi0)
{
  uint           row;
  uint           swp;
  uint           num;
#if NR_ROWS_LOG == 14
  swp = 0;
#elif NR_ROWS_LOG == 15
  swp = 1;
#elif NR_ROWS_LOG == 16
  swp = 2;
#else
#error "unsupported NR_ROWS_LOG"
#endif
  num = (40 << swp) - 1);
  if (!(round % 2))
    row = (xi0 & ((num << 8 | 0xff));
  else
    row = ((xi0 & (num << 16 | 0xf00)) >> 8) | ((xi0 & 0xf0000000) >> 24);
  return row;
}

So, what do you think, @zawawa?

I don't know if this can be useful at all, but if you like it I can make a PR so you can merge the changes later.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on December 15, 2016, 06:31:30 PM

Not bad zawawa. You still have room to improve ht_store.

Code:

p = slot.ui8

Will at best result in 2 store_dwordx4 instructions, and 2 core cycles to the memory controller.

Code:

p = slot.ui4[0]

Which you use after round 5 should only be one cycle, but it will force a 32-byte read burst from the GDDR into the L2, modification of 16 bytes, and then write back. This will waste a lot of GDDR cycles due to the bus turnaround delay. The solution is to have a n-way operation where n threads write 32/n bytes. That will be just core one cycle to xfer 32 bytes to the L2, and a single 32-byte write burst to one of the 2 GDDR5 chips per memory controller channel.
I also think using an odd number for NR_SLOTS should be a tiny bit faster by balancing out the writes between the odd and even memory chips. With NR_SLOTS even, the first write to a given row will always be to an even memory chip. With more slots per row this becomes less significant because the rows don't fill up equally. Using an odd number for NR_SLOTS may also reduce channel conflicts.

I tried 4-way writes with mixed results. The 4-way write version was actually slower than the single-thread-write version, but the former seems to speed up the last few rounds. It makes sense as these rounds are more memory-intensive. I will explore this approach further.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on December 15, 2016, 06:31:30 PM

Not bad zawawa. You still have room to improve ht_store.

Code:

p = slot.ui8

Will at best result in 2 store_dwordx4 instructions, and 2 core cycles to the memory controller.

Code:

p = slot.ui4[0]

Which you use after round 5 should only be one cycle, but it will force a 32-byte read burst from the GDDR into the L2, modification of 16 bytes, and then write back. This will waste a lot of GDDR cycles due to the bus turnaround delay. The solution is to have a n-way operation where n threads write 32/n bytes. That will be just core one cycle to xfer 32 bytes to the L2, and a single 32-byte write burst to one of the 2 GDDR5 chips per memory controller channel.
I also think using an odd number for NR_SLOTS should be a tiny bit faster by balancing out the writes between the odd and even memory chips. With NR_SLOTS even, the first write to a given row will always be to an even memory chip. With more slots per row this becomes less significant because the rows don't fill up equally. Using an odd number for NR_SLOTS may also reduce channel conflicts.

Very interesting suggestions. Let me see...

nerdralph

sr. member

Activity: 588

Merit: 251

Not bad zawawa. You still have room to improve ht_store.

Code:

p = slot.ui8

Will at best result in 2 store_dwordx4 instructions, and 2 core cycles to the memory controller.

Code:

p = slot.ui4[0]

Which you use after round 5 should only be one cycle, but it will force a 32-byte read burst from the GDDR into the L2, modification of 16 bytes, and then write back. This will waste a lot of GDDR cycles due to the bus turnaround delay. The solution is to have a n-way operation where n threads write 32/n bytes. That will be just core one cycle to xfer 32 bytes to the L2, and a single 32-byte write burst to one of the 2 GDDR5 chips per memory controller channel.
I also think using an odd number for NR_SLOTS should be a tiny bit faster by balancing out the writes between the odd and even memory chips. With NR_SLOTS even, the first write to a given row will always be to an even memory chip. With more slots per row this becomes less significant because the rows don't fill up equally. Using an odd number for NR_SLOTS may also reduce channel conflicts.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: Vetal_inside on December 15, 2016, 04:45:45 PM

Quote from: zawawa on December 15, 2016, 04:14:51 PM

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...

This is memory timings patch. Not sure that it can be a reason for this low solrate.
But, on next few days I will try install latest crimson drivers and reflash stock bios. Will see what will change.

Performance does suffer if memory timings are too tight.
In the meantime, I will test the miner with my trusty 7990's...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: krnlx on December 15, 2016, 04:53:09 PM

@zawawa

Nvidia cards run faster with NR_ROWS_LOG = 14
Can you check my settings for NR_ROWS_LOG = 14 ? All is correct ?

I think, it will be faster with NR_ROWS_LOG=12...

Code:

#define NR_ROWS_LOG            14
#define NR_SLOTS               240
#define LOCAL_WORK_SIZE        512
#define THREADS_PER_ROW        512
#define LOCAL_WORK_SIZE_SOLS   256
#define THREADS_PER_ROW_SOLS   256
#define GLOBAL_WORK_SIZE_RATIO 512
#define SLOT_CACHE_SIZE        (NR_SLOTS * (LOCAL_WORK_SIZE/THREADS_PER_ROW) * 75 / 100)
#define LDS_COLL_SIZE          (NR_SLOTS * (LOCAL_WORK_SIZE / THREADS_PER_ROW) * 240 / 100)

What speed are you getting on which card? I'm very curious. You could lower NR_SLOTS by 10 or 20, I think. You can uncomment "#define ENABLE_DEBUG", rebuild the app, and run sa-solver.exe to see how many slots drop out at each round. Too many dropped slots would hurt performance. Adding NR_ROWS_LOG=12 itself is trivial, but there may not be enough shared memory.

laik2

sr. member

Activity: 652

Merit: 266

Quote from: krnlx on December 15, 2016, 04:53:09 PM

@zawawa

Nvidia cards run faster with NR_ROWS_LOG = 14
Can you check my settings for NR_ROWS_LOG = 14 ? All is correct ?

I think, it will be faster with NR_ROWS_LOG=12...

Code:

#define NR_ROWS_LOG            14
#define NR_SLOTS               240
#define LOCAL_WORK_SIZE        512
#define THREADS_PER_ROW        512
#define LOCAL_WORK_SIZE_SOLS   256
#define THREADS_PER_ROW_SOLS   256
#define GLOBAL_WORK_SIZE_RATIO 512
#define SLOT_CACHE_SIZE        (NR_SLOTS * (LOCAL_WORK_SIZE/THREADS_PER_ROW) * 75 / 100)
#define LDS_COLL_SIZE          (NR_SLOTS * (LOCAL_WORK_SIZE / THREADS_PER_ROW) * 240 / 100)

Proper CUDA implementation is required for NV to boost over 300S/s. There are already nicehash and EWBF CUDA closed source miners doing ~300S/s on 1070. I am waiting on my 1070s to arrive so I can test some CUDA tweaks.

krnlx

full member

Activity: 243

Merit: 105

@zawawa

Nvidia cards run faster with NR_ROWS_LOG = 14
Can you check my settings for NR_ROWS_LOG = 14 ? All is correct ?

I think, it will be faster with NR_ROWS_LOG=12...

Code:

#define NR_ROWS_LOG            14
#define NR_SLOTS               240
#define LOCAL_WORK_SIZE        512
#define THREADS_PER_ROW        512
#define LOCAL_WORK_SIZE_SOLS   256
#define THREADS_PER_ROW_SOLS   256
#define GLOBAL_WORK_SIZE_RATIO 512
#define SLOT_CACHE_SIZE        (NR_SLOTS * (LOCAL_WORK_SIZE/THREADS_PER_ROW) * 75 / 100)
#define LDS_COLL_SIZE          (NR_SLOTS * (LOCAL_WORK_SIZE / THREADS_PER_ROW) * 240 / 100)

Vetal_inside

member

Activity: 78

Merit: 10

Quote from: zawawa on December 15, 2016, 04:14:51 PM

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...

This is memory timings patch. Not sure that it can be a reason for this low solrate.
But, on next few days I will try install latest crimson drivers and reflash stock bios. Will see what will change.

krnlx

full member

Activity: 243

Merit: 105

Quote

Total 1094.3 sol/s [dev0 177.9, dev1 176.8, dev2 182.7, dev3 180.9, dev4 185.4, dev5 185.1] 36 shares
Total 1093.9 sol/s [dev0 177.6, dev1 177.4, dev2 181.9, dev3 180.4, dev4 184.8, dev5 185.2] 36 shares
Total 1094.0 sol/s [dev0 177.6, dev1 177.4, dev2 182.0, dev3 180.7, dev4 185.5, dev5 185.4] 38 shares
Total 1093.3 sol/s [dev0 177.5, dev1 176.6, dev2 182.2, dev3 179.8, dev4 186.6, dev5 184.7] 38 shares
Total 1092.8 sol/s [dev0 178.5, dev1 176.9, dev2 181.7, dev3 180.7, dev4 185.8, dev5 184.8] 38 shares
Total 1093.1 sol/s [dev0 177.7, dev1 177.1, dev2 181.4, dev3 180.4, dev4 186.1, dev5 184.0] 40 shares
Total 1093.2 sol/s [dev0 177.1, dev1 177.8, dev2 182.2, dev3 179.9, dev4 186.3, dev5 182.7] 40 shares
Total 1093.5 sol/s [dev0 176.8, dev1 178.0, dev2 182.0, dev3 180.2, dev4 186.5, dev5 182.8] 40 shares

6x1070 with a little tune

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: Vetal_inside on December 15, 2016, 02:56:28 PM

R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...

Vetal_inside

member

Activity: 78

Merit: 10

R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock

laik2

sr. member

Activity: 652

Merit: 266

Quote from: zawawa on December 15, 2016, 02:04:10 PM

Quote from: Linit on December 15, 2016, 01:59:56 PM

Ubuntu 15.04 64 bit.
Driver fglrx 15.12.
Gigabyte R9 390 G1.

160 S/s.

Very nice! I would like to reach 200 sol/s without a GCN assembler.
We will see.

Without GCN asm 390s should reach 300S/s at most.
Multialgo miner is sgminer but documentation is hell...until I find some useful values for a card my beard looks like Santa Claus's.

Linit

newbie

Activity: 13

Merit: 0

Quote from: zawawa on December 15, 2016, 02:04:10 PM

Quote from: Linit on December 15, 2016, 01:59:56 PM

Ubuntu 15.04 64 bit.
Driver fglrx 15.12.
Gigabyte R9 390 G1.

160 S/s.

Very nice! I would like to reach 200 sol/s without a GCN assembler.
We will see.

Excellent...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: Linit on December 15, 2016, 01:59:56 PM

Ubuntu 15.04 64 bit.
Driver fglrx 15.12.
Gigabyte R9 390 G1.

160 S/s.

Very nice! I would like to reach 200 sol/s without a GCN assembler.
We will see.

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 195. (Read 214463 times)