Author

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 155. (Read 214431 times)

sr. member
Activity: 728
Merit: 304
Miner Developer
Alright, I'm getting 250 sol/s with stock RX 480 on Ubuntu.
This should be good enough as a starter.
I'm getting a pretty good hang of the GCN inline assembly, so the rest of GG's development should be a pretty smooth ride.
sr. member
Activity: 728
Merit: 304
Miner Developer
While I was trying to find a way to access the entire GDS, I realized I don't need that much memory for row counters.
All I have to do is to squeeze four 10-bit row counters into a 32-bit integer. Let's see...
sr. member
Activity: 728
Merit: 304
Miner Developer

Looks promising, however I find you really need to dig through the code and experiment to see what does and does not work.  I don't consider myself a kernel module developer, so you might already know more about this than I do.  With closed-source drivers like AMDGPU-Pro, it's hard to figure out which parts of the kernel drm API are implemented, and even if they are implemented whether they work.  For example the 16.40 drivers implements the powerplay function force_clock_level(), but it only seems to support type PP_SCLK.


You are absolutely right about that. Thanks a lot for sticking around!
sr. member
Activity: 588
Merit: 251

Looks promising, however I find you really need to dig through the code and experiment to see what does and does not work.  I don't consider myself a kernel module developer, so you might already know more about this than I do.  With closed-source drivers like AMDGPU-Pro, it's hard to figure out which parts of the kernel drm API are implemented, and even if they are implemented whether they work.  For example the 16.40 drivers implements the powerplay function force_clock_level(), but it only seems to support type PP_SCLK.
sr. member
Activity: 728
Merit: 304
Miner Developer
Looks good to me... I just need to set these registers, no?

Code:
#define mmGDS_VMID0_BASE                                                        0x3300
#define mmGDS_VMID1_BASE                                                        0x3302
#define mmGDS_VMID2_BASE                                                        0x3304
#define mmGDS_VMID3_BASE                                                        0x3306
#define mmGDS_VMID4_BASE                                                        0x3308
#define mmGDS_VMID5_BASE                                                        0x330a
#define mmGDS_VMID6_BASE                                                        0x330c
#define mmGDS_VMID7_BASE                                                        0x330e
#define mmGDS_VMID8_BASE                                                        0x3310
#define mmGDS_VMID9_BASE                                                        0x3312
#define mmGDS_VMID10_BASE                                                       0x3314
#define mmGDS_VMID11_BASE                                                       0x3316
#define mmGDS_VMID12_BASE                                                       0x3318
#define mmGDS_VMID13_BASE                                                       0x331a
#define mmGDS_VMID14_BASE                                                       0x331c
#define mmGDS_VMID15_BASE                                                       0x331e
#define mmGDS_VMID0_SIZE                                                        0x3301
#define mmGDS_VMID1_SIZE                                                        0x3303
#define mmGDS_VMID2_SIZE                                                        0x3305
#define mmGDS_VMID3_SIZE                                                        0x3307
#define mmGDS_VMID4_SIZE                                                        0x3309
#define mmGDS_VMID5_SIZE                                                        0x330b
#define mmGDS_VMID6_SIZE                                                        0x330d
#define mmGDS_VMID7_SIZE                                                        0x330f
#define mmGDS_VMID8_SIZE                                                        0x3311
#define mmGDS_VMID9_SIZE                                                        0x3313
https://github.com/torvalds/linux/blob/5924bbecd0267d87c24110cbe2041b5075173a25/drivers/gpu/drm/amd/include/asic_reg/gca/gfx_7_0_d.h
sr. member
Activity: 728
Merit: 304
Miner Developer
I just found out that you can directly send commands to the GPU without the root privileges by using the DRM render node:

https://en.wikipedia.org/wiki/Direct_Rendering_Manager#Render_nodes

I think nirvana is *pretty* close...
sr. member
Activity: 728
Merit: 304
Miner Developer
This is the new version of parallel writes.
It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes.
The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.

And I think ds_swizzle was only introduced in GCN3, so it would not work on Hawaii and Tahiti.


Actually, GCN1 supports ds_swizzle. It is ds_permute and ds_bpermute that were newly introduced with GCN3.
sr. member
Activity: 588
Merit: 251
This is the new version of parallel writes.
It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes.
The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.

And I think ds_swizzle was only introduced in GCN3, so it would not work on Hawaii and Tahiti.
sr. member
Activity: 728
Merit: 304
Miner Developer
This is the new version of parallel writes.
It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes.
The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.
I am trying to merge these two inline assembly sections so that I could squeeze conditionals in between ds_swizzle_b32's.

Code:
    const int swap_data = (get_local_id(0) & 0x1);
    __global uint4 *second_p = p + 1;
    uint4 second_ui4 = slot.ui4[1];
    __asm(// See: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
          "ds_swizzle_b32 %0.x, %2.x offset:0x041f\n"
          "ds_swizzle_b32 %0.y, %2.y offset:0x041f\n"
          "ds_swizzle_b32 %1.x, %3.x offset:0x041f\n"
          "ds_swizzle_b32 %1.y, %3.y offset:0x041f\n"
          "ds_swizzle_b32 %1.z, %3.z offset:0x041f\n"
          "ds_swizzle_b32 %1.w, %3.w offset:0x041f\n"
          "s_waitcnt lgkmcnt(0)\n"
          : "=v" (second_p),
            "=v" (second_ui4)
          : "0" (second_p),
            "1" (second_ui4)
          : "memory");
   
    __asm("flat_store_dwordx4 %0, %2\n"
          "flat_store_dwordx4 %1, %3\n"
          :
          : "v" (swap_data ? second_p    : p),
            "v" (swap_data ? p           : second_p),
            "v" (swap_data ? second_ui4  : slot.ui4[0]),
            "v" (swap_data ? slot.ui4[0] : second_ui4)
          : "memory");
sr. member
Activity: 728
Merit: 304
Miner Developer
GDS counters and a new implementation of parallel writes are working on RX 480 now.
I just need to optimize them further at this point.
I wish I could do miner development full-time.
It's so much fun and engaging. Oh well.
sr. member
Activity: 728
Merit: 304
Miner Developer
more or less you know when to release your miner?

I will release it when I'm satisfied with the performance.
Hopefully within the next few days.
sr. member
Activity: 728
Merit: 304
Miner Developer
This patch gives +5-6% on NVidia GTX10xx cards:
Quote
--- a/Core/kernel/equihash.cl
+++ b/Core/kernel/equihash.cl
@@ -102,7 +102,11 @@ typedef __global slot_t *global_pointer_to_slot_t;
 
 __global char *get_slot_ptr(__global char *ht, uint round, uint row, uint slot)
 {
-    return ht + (row * _NR_SLOTS(round) + slot) * _SLOT_LEN(round);
+  // Split row into several sub-rows with 2^RowFragmentLog slots, it gives more L2 cache hits
+  const uint RowFragmentLog = 4;
+  const uint SlotsInRow = 1 << RowFragmentLog;
+  const uint SlotMask = (1 << RowFragmentLog) - 1;
+  return ht + ((slot >> RowFragmentLog)*_NR_ROWS(round)*_SLOT_LEN(round)*SlotsInRow) + (row*_SLOT_LEN(round)*SlotsInRow) + (slot & SlotMask)*_SLOT_LEN(round);
 }
Can't test on AMD now, may be it gives same performance advantage.
Unfortunatelly, this miner useless for NV cards.. only 280sols/s on GTX1070.

I tried several variations of this patch on RX480, but they didn't work...
Thanks for the patch anyway. I would love to work on optimizations for NVIDIA, but first thing first, you know.
sr. member
Activity: 2106
Merit: 282
👉bit.ly/3QXp3oh | 🔥 Ultimate Launc
This patch gives +5-6% on NVidia GTX10xx cards:
Quote
--- a/Core/kernel/equihash.cl
+++ b/Core/kernel/equihash.cl
@@ -102,7 +102,11 @@ typedef __global slot_t *global_pointer_to_slot_t;
 
 __global char *get_slot_ptr(__global char *ht, uint round, uint row, uint slot)
 {
-    return ht + (row * _NR_SLOTS(round) + slot) * _SLOT_LEN(round);
+  // Split row into several sub-rows with 2^RowFragmentLog slots, it gives more L2 cache hits
+  const uint RowFragmentLog = 4;
+  const uint SlotsInRow = 1 << RowFragmentLog;
+  const uint SlotMask = (1 << RowFragmentLog) - 1;
+  return ht + ((slot >> RowFragmentLog)*_NR_ROWS(round)*_SLOT_LEN(round)*SlotsInRow) + (row*_SLOT_LEN(round)*SlotsInRow) + (slot & SlotMask)*_SLOT_LEN(round);
 }
Can't test on AMD now, may be it gives same performance advantage.
Unfortunatelly, this miner useless for NV cards.. only 280sols/s on GTX1070.
sr. member
Activity: 728
Merit: 304
Miner Developer
I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800.  So it seems there is a way to initialize the GDS in Windoze too.


I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.

So you're saying Claymore on Windows doesn't use GDS, while under Linux it does, but still only gets the same general performance?  Not likely IMHO.
Plus, as I've explained before, it's impossible to get much more than 200 sols from a Rx 470 clocked at 1250/1750 without using GDS.  Even getting near 200 requires custom BIOS strap mods.

p.s. I just tested Claymore 12.3 on Linux 4.8 with AMDGPU-Pro 16.40 on a Rx 470 clocked at 1200/1900.  It gets 279 sols (optiminer is 268 on the same box).


Oh, I see. I thought the Linux version of his miner was still faster. My bad.

By the way, GG is already running faster even with a fairly limited amount of the GDS.
We will see...
sr. member
Activity: 588
Merit: 251
I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800.  So it seems there is a way to initialize the GDS in Windoze too.


I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.

So you're saying Claymore on Windows doesn't use GDS, while under Linux it does, but still only gets the same general performance?  Not likely IMHO.
Plus, as I've explained before, it's impossible to get much more than 200 sols from a Rx 470 clocked at 1250/1750 without using GDS.  Even getting near 200 requires custom BIOS strap mods.

p.s. I just tested Claymore 12.3 on Linux 4.8 with AMDGPU-Pro 16.40 on a Rx 470 clocked at 1200/1900.  It gets 279 sols (optiminer is 268 on the same box).
sr. member
Activity: 728
Merit: 304
Miner Developer
Code:
$ sudo tests/modeprint/modeprint amdgpu
Starting test
gds_gfx_partition_size: 4096
compute_partition_size: 4096
gds_total_size: 65536
gws_per_gfx_partition: 4
gws_per_compute_partition: 4
oa_per_gfx_partition: 4
oa_per_compute_partition: 1
Ok

I knew it! What a piece of junk...

I'm impressed, and a bit surprised.  They're obviously using the GDS for GWS, but that *should* be documented somewhere.

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800.  So it seems there is a way to initialize the GDS in Windoze too.


Well, it's AMD... What else can I say? I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.
sr. member
Activity: 588
Merit: 251
Code:
$ sudo tests/modeprint/modeprint amdgpu
Starting test
gds_gfx_partition_size: 4096
compute_partition_size: 4096
gds_total_size: 65536
gws_per_gfx_partition: 4
gws_per_compute_partition: 4
oa_per_gfx_partition: 4
oa_per_compute_partition: 1
Ok

I knew it! What a piece of junk...

I'm impressed, and a bit surprised.  They're obviously using the GDS for GWS, but that *should* be documented somewhere.

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800.  So it seems there is a way to initialize the GDS in Windoze too.
sr. member
Activity: 728
Merit: 304
Miner Developer
Code:
$ sudo tests/modeprint/modeprint amdgpu
Starting test
gds_gfx_partition_size: 4096
compute_partition_size: 4096
gds_total_size: 65536
gws_per_gfx_partition: 4
gws_per_compute_partition: 4
oa_per_gfx_partition: 4
oa_per_compute_partition: 1
Ok

I knew it! What a piece of junk...
hero member
Activity: 798
Merit: 1000
more or less you know when to release your miner?
Jump to: