Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 155.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Alright, I'm getting 250 sol/s with stock RX 480 on Ubuntu.
This should be good enough as a starter.
I'm getting a pretty good hang of the GCN inline assembly, so the rest of GG's development should be a pretty smooth ride.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

While I was trying to find a way to access the entire GDS, I realized I don't need that much memory for row counters.
All I have to do is to squeeze four 10-bit row counters into a 32-bit integer. Let's see...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on March 12, 2017, 11:42:14 AM

Quote from: zawawa on March 11, 2017, 11:47:58 PM

Bingo!

https://github.com/torvalds/linux/search?utf8=%E2%9C%93&q=cs_partition_size

Looks promising, however I find you really need to dig through the code and experiment to see what does and does not work. I don't consider myself a kernel module developer, so you might already know more about this than I do. With closed-source drivers like AMDGPU-Pro, it's hard to figure out which parts of the kernel drm API are implemented, and even if they are implemented whether they work. For example the 16.40 drivers implements the powerplay function force_clock_level(), but it only seems to support type PP_SCLK.

You are absolutely right about that. Thanks a lot for sticking around!

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on March 11, 2017, 11:47:58 PM

Bingo!

https://github.com/torvalds/linux/search?utf8=%E2%9C%93&q=cs_partition_size

Looks promising, however I find you really need to dig through the code and experiment to see what does and does not work. I don't consider myself a kernel module developer, so you might already know more about this than I do. With closed-source drivers like AMDGPU-Pro, it's hard to figure out which parts of the kernel drm API are implemented, and even if they are implemented whether they work. For example the 16.40 drivers implements the powerplay function force_clock_level(), but it only seems to support type PP_SCLK.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Looks good to me... I just need to set these registers, no?

Code:

#define mmGDS_VMID0_BASE                                                        0x3300
#define mmGDS_VMID1_BASE                                                        0x3302
#define mmGDS_VMID2_BASE                                                        0x3304
#define mmGDS_VMID3_BASE                                                        0x3306
#define mmGDS_VMID4_BASE                                                        0x3308
#define mmGDS_VMID5_BASE                                                        0x330a
#define mmGDS_VMID6_BASE                                                        0x330c
#define mmGDS_VMID7_BASE                                                        0x330e
#define mmGDS_VMID8_BASE                                                        0x3310
#define mmGDS_VMID9_BASE                                                        0x3312
#define mmGDS_VMID10_BASE                                                       0x3314
#define mmGDS_VMID11_BASE                                                       0x3316
#define mmGDS_VMID12_BASE                                                       0x3318
#define mmGDS_VMID13_BASE                                                       0x331a
#define mmGDS_VMID14_BASE                                                       0x331c
#define mmGDS_VMID15_BASE                                                       0x331e
#define mmGDS_VMID0_SIZE                                                        0x3301
#define mmGDS_VMID1_SIZE                                                        0x3303
#define mmGDS_VMID2_SIZE                                                        0x3305
#define mmGDS_VMID3_SIZE                                                        0x3307
#define mmGDS_VMID4_SIZE                                                        0x3309
#define mmGDS_VMID5_SIZE                                                        0x330b
#define mmGDS_VMID6_SIZE                                                        0x330d
#define mmGDS_VMID7_SIZE                                                        0x330f
#define mmGDS_VMID8_SIZE                                                        0x3311
#define mmGDS_VMID9_SIZE                                                        0x3313

https://github.com/torvalds/linux/blob/5924bbecd0267d87c24110cbe2041b5075173a25/drivers/gpu/drm/amd/include/asic_reg/gca/gfx_7_0_d.h

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Bingo!

https://github.com/torvalds/linux/search?utf8=%E2%9C%93&q=cs_partition_size

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I just found out that you can directly send commands to the GPU without the root privileges by using the DRM render node:

https://en.wikipedia.org/wiki/Direct_Rendering_Manager#Render_nodes

I think nirvana is *pretty* close...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on March 11, 2017, 11:57:10 AM

Quote from: zawawa on March 11, 2017, 11:41:00 AM

This is the new version of parallel writes.
It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes.
The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.

And I think ds_swizzle was only introduced in GCN3, so it would not work on Hawaii and Tahiti.

Actually, GCN1 supports ds_swizzle. It is ds_permute and ds_bpermute that were newly introduced with GCN3.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on March 11, 2017, 11:41:00 AM

This is the new version of parallel writes.
It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes.
The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.

And I think ds_swizzle was only introduced in GCN3, so it would not work on Hawaii and Tahiti.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

This is the new version of parallel writes.
It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes.
The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.
I am trying to merge these two inline assembly sections so that I could squeeze conditionals in between ds_swizzle_b32's.

Code:

    const int swap_data = (get_local_id(0) & 0x1);
    __global uint4 *second_p = p + 1;
    uint4 second_ui4 = slot.ui4[1];
    __asm(// See: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
          "ds_swizzle_b32 %0.x, %2.x offset:0x041f\n"
          "ds_swizzle_b32 %0.y, %2.y offset:0x041f\n"
          "ds_swizzle_b32 %1.x, %3.x offset:0x041f\n"
          "ds_swizzle_b32 %1.y, %3.y offset:0x041f\n"
          "ds_swizzle_b32 %1.z, %3.z offset:0x041f\n"
          "ds_swizzle_b32 %1.w, %3.w offset:0x041f\n"
          "s_waitcnt lgkmcnt(0)\n"
          : "=v" (second_p), 
            "=v" (second_ui4)
          : "0" (second_p), 
            "1" (second_ui4)
          : "memory");
    
    __asm("flat_store_dwordx4 %0, %2\n"
          "flat_store_dwordx4 %1, %3\n"
          : 
          : "v" (swap_data ? second_p    : p), 
            "v" (swap_data ? p           : second_p), 
            "v" (swap_data ? second_ui4  : slot.ui4[0]),
            "v" (swap_data ? slot.ui4[0] : second_ui4)
          : "memory");

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

GDS counters and a new implementation of parallel writes are working on RX 480 now.
I just need to optimize them further at this point.
I wish I could do miner development full-time.
It's so much fun and engaging. Oh well.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: marvykkio on March 09, 2017, 03:45:57 PM

more or less you know when to release your miner?

I will release it when I'm satisfied with the performance.
Hopefully within the next few days.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: eXtremal on March 10, 2017, 01:49:48 PM

This patch gives +5-6% on NVidia GTX10xx cards:

Quote

--- a/Core/kernel/equihash.cl
+++ b/Core/kernel/equihash.cl
@@ -102,7 +102,11 @@ typedef __global slot_t *global_pointer_to_slot_t;

__global char *get_slot_ptr(__global char *ht, uint round, uint row, uint slot)
{
- return ht + (row * _NR_SLOTS(round) + slot) * _SLOT_LEN(round);
+ // Split row into several sub-rows with 2^RowFragmentLog slots, it gives more L2 cache hits
+ const uint RowFragmentLog = 4;
+ const uint SlotsInRow = 1 << RowFragmentLog;
+ const uint SlotMask = (1 << RowFragmentLog) - 1;
+ return ht + ((slot >> RowFragmentLog)*_NR_ROWS(round)*_SLOT_LEN(round)*SlotsInRow) + (row*_SLOT_LEN(round)*SlotsInRow) + (slot & SlotMask)*_SLOT_LEN(round);
}

Can't test on AMD now, may be it gives same performance advantage.
Unfortunatelly, this miner useless for NV cards.. only 280sols/s on GTX1070.

I tried several variations of this patch on RX480, but they didn't work...
Thanks for the patch anyway. I would love to work on optimizations for NVIDIA, but first thing first, you know.

eXtremal

sr. member

Activity: 2106

Merit: 282

👉bit.ly/3QXp3oh | 🔥 Ultimate Launc

This patch gives +5-6% on NVidia GTX10xx cards:

Quote

--- a/Core/kernel/equihash.cl
+++ b/Core/kernel/equihash.cl
@@ -102,7 +102,11 @@ typedef __global slot_t *global_pointer_to_slot_t;

__global char *get_slot_ptr(__global char *ht, uint round, uint row, uint slot)
{
- return ht + (row * _NR_SLOTS(round) + slot) * _SLOT_LEN(round);
+ // Split row into several sub-rows with 2^RowFragmentLog slots, it gives more L2 cache hits
+ const uint RowFragmentLog = 4;
+ const uint SlotsInRow = 1 << RowFragmentLog;
+ const uint SlotMask = (1 << RowFragmentLog) - 1;
+ return ht + ((slot >> RowFragmentLog)*_NR_ROWS(round)*_SLOT_LEN(round)*SlotsInRow) + (row*_SLOT_LEN(round)*SlotsInRow) + (slot & SlotMask)*_SLOT_LEN(round);
}

Can't test on AMD now, may be it gives same performance advantage.
Unfortunatelly, this miner useless for NV cards.. only 280sols/s on GTX1070.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on March 10, 2017, 08:19:28 AM

Quote from: zawawa on March 10, 2017, 01:45:25 AM

Quote from: nerdralph on March 09, 2017, 06:59:21 PM

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800. So it seems there is a way to initialize the GDS in Windoze too.

I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.

So you're saying Claymore on Windows doesn't use GDS, while under Linux it does, but still only gets the same general performance? Not likely IMHO.
Plus, as I've explained before, it's impossible to get much more than 200 sols from a Rx 470 clocked at 1250/1750 without using GDS. Even getting near 200 requires custom BIOS strap mods.

p.s. I just tested Claymore 12.3 on Linux 4.8 with AMDGPU-Pro 16.40 on a Rx 470 clocked at 1200/1900. It gets 279 sols (optiminer is 268 on the same box).

Oh, I see. I thought the Linux version of his miner was still faster. My bad.

By the way, GG is already running faster even with a fairly limited amount of the GDS.
We will see...

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on March 10, 2017, 01:45:25 AM

Quote from: nerdralph on March 09, 2017, 06:59:21 PM

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800. So it seems there is a way to initialize the GDS in Windoze too.

I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.

So you're saying Claymore on Windows doesn't use GDS, while under Linux it does, but still only gets the same general performance? Not likely IMHO.
Plus, as I've explained before, it's impossible to get much more than 200 sols from a Rx 470 clocked at 1250/1750 without using GDS. Even getting near 200 requires custom BIOS strap mods.

p.s. I just tested Claymore 12.3 on Linux 4.8 with AMDGPU-Pro 16.40 on a Rx 470 clocked at 1200/1900. It gets 279 sols (optiminer is 268 on the same box).

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on March 09, 2017, 06:59:21 PM

Quote from: zawawa on March 09, 2017, 04:38:31 PM

Code:

$ sudo tests/modeprint/modeprint amdgpu
Starting test
gds_gfx_partition_size: 4096
compute_partition_size: 4096
gds_total_size: 65536
gws_per_gfx_partition: 4
gws_per_compute_partition: 4
oa_per_gfx_partition: 4
oa_per_compute_partition: 1
Ok

I knew it! What a piece of junk...

I'm impressed, and a bit surprised. They're obviously using the GDS for GWS, but that *should* be documented somewhere.

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800. So it seems there is a way to initialize the GDS in Windoze too.

Well, it's AMD... What else can I say? I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on March 09, 2017, 04:38:31 PM

Code:

$ sudo tests/modeprint/modeprint amdgpu
Starting test
gds_gfx_partition_size: 4096
compute_partition_size: 4096
gds_total_size: 65536
gws_per_gfx_partition: 4
gws_per_compute_partition: 4
oa_per_gfx_partition: 4
oa_per_compute_partition: 1
Ok

I knew it! What a piece of junk...

I'm impressed, and a bit surprised. They're obviously using the GDS for GWS, but that *should* be documented somewhere.

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800. So it seems there is a way to initialize the GDS in Windoze too.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Code:

$ sudo tests/modeprint/modeprint amdgpu
Starting test
gds_gfx_partition_size: 4096
compute_partition_size: 4096
gds_total_size: 65536
gws_per_gfx_partition: 4
gws_per_compute_partition: 4
oa_per_gfx_partition: 4
oa_per_compute_partition: 1
Ok

I knew it! What a piece of junk...

marvykkio

hero member

Activity: 798

Merit: 1000

more or less you know when to release your miner?

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 155. (Read 214458 times)