Author

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 177. (Read 214456 times)

sr. member
Activity: 588
Merit: 251
I suspect the driver initializes M0 when gds_segment_byte_size is set in the kernel configuration.

I assumed that the GDS base/size combination would be stored in one of SGPR's just like the OpenCL 1.2 ABI, but you may be right. I will check it right now.

Nope, no luck. O GDS, where art thou?

I've been meaning to look into why optiminer requires GPU_FORCE_64BIT_PTR=1.  Perhaps some quirk of the driver that GDS only works in 64-bit mode?
sr. member
Activity: 728
Merit: 304
Miner Developer
Well, the worst case, I can still continue to develop the assembly version with 7990 while trying to figure out how to access it on newer cards.
sr. member
Activity: 728
Merit: 304
Miner Developer
I suspect the driver initializes M0 when gds_segment_byte_size is set in the kernel configuration.

I assumed that the GDS base/size combination would be stored in one of SGPR's just like the OpenCL 1.2 ABI, but you may be right. I will check it right now.

Nope, no luck. O GDS, where art thou?
sr. member
Activity: 728
Merit: 304
Miner Developer
I suspect the driver initializes M0 when gds_segment_byte_size is set in the kernel configuration.

I assumed that the GDS base/size combination would be stored in one of SGPR's just like the OpenCL 1.2 ABI, but you may be right. I will check it right now.
sr. member
Activity: 588
Merit: 251
Do I need to initialize GDS before actually using it?
These instructions are documented nowhere.

Code:
DS_CONSUME
DS_APPEND
DS_ORDERED_COUNT

nerdralph, do you have any ideas?

I suspect the driver initializes M0 when gds_segment_byte_size is set in the kernel configuration.  If you look in the GCN ISA docs, it says M0 has 16 bits for offset and 16 bits for size.  M0 is also used for LDS, so when you use both in your code you'll need to save it to another register.

I hadn't looked at the DS_ instructions you refer to, and a quick look at the ISA confirms your observation about them having no documentation.  The llvm source would at least have the instruction encoding.

I'm not sure why you want to use those instructions though.  For the global row counters I'd use ds_add_u32 with the GDS bit set.

p.s. the M0 description is in s. 3.7 of the GCN ISA docs.

sr. member
Activity: 728
Merit: 304
Miner Developer
There you go!

Quote
2.9 Misc/Data Transfer Packets
2.9.1 ALLOC_GDS
The packet will allocate a new segment within its corresponding GDS partition. The corresponding partition is
determined from the Ring to which the packet is submitted. The microcode will first wait until the active partition
count equals zero before continuing. This guarantees that the entire contents of the previous allocated segment have
been dumped to memory before allocating the new segment within the current partition. It will also check if the
segment size is less than partition size and interrupt if the current segment does not fit into its specified partition
sr. member
Activity: 728
Merit: 304
Miner Developer
i heard that rx480 have opencl 2.0, would be any benefits when using abi 2.0?

The OpenCL 2.0 ABI does not make any differences. I might have to bypass the driver and send raw packets directly to the GPU to enable GDS. This is crazy.

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/10/si_programming_guide_v2.pdf
https://github.com/fail0verflow/radeon-tools/blob/master/f32/f32dis.py
sr. member
Activity: 273
Merit: 250
BD People Are Legend
i heard that rx480 have opencl 2.0, would be any benefits when using abi 2.0?
sr. member
Activity: 728
Merit: 304
Miner Developer
Hmm... It seems that GDS is not activated for some reasons.
What to do, what to do...
sr. member
Activity: 728
Merit: 304
Miner Developer
Do I need to initialize GDS before actually using it?
These instructions are documented nowhere.

Code:
DS_CONSUME
DS_APPEND
DS_ORDERED_COUNT

nerdralph, do you have any ideas?
sr. member
Activity: 728
Merit: 304
Miner Developer

The possibility of using inline asm for GDS access with the rest of the kernel in straight OpenCL looks promising to me...


That would be really nice, but I need a solution that works right now.
I had to go through another hoop and turn on the "enable_ordered_append_gds" bit, but I finally located where the GDS base is stored. I am getting really close!
sr. member
Activity: 588
Merit: 251

I didn't need the OpenCL 1.2 ABI or HSAIL after all.
This most likely means I should be able to catch up with Optiminer.
Good stuff, good stuff.

You wouldn't have had much luck with HSAIL anyway; I'm pretty sure I already mentioned there's no GDS instructions in HSAIL.


Really? I don't recall that... The ROCm ABI does expose GDS, though. I will doublecheck.

I confirmed it with one of the AMD devs working on llvm.  He said there was plans for a GCN extension that never got implemented in the HSAIL llvm backend since they are now focused on the AMDGPU backend.
ROCm also now supports OpenCL kernels.
https://www.khronos.org/news/permalink/rocm-1.4-has-support-for-opencl-1.2-host-code-and-2.0-kernels

The possibility of using inline asm for GDS access with the rest of the kernel in straight OpenCL looks promising to me...
sr. member
Activity: 588
Merit: 251
I added a new pseudo-op for Global Data Share (GDS) to CLRadeonExtender:

https://github.com/CLRX/CLRX-mirror/pull/11

It will be so much fun if we can freely exploit this killer feature at last...

Nice.  With this change there should be no more need to explicitly initialize M0 (except maybe for GCN1 devices since they only have OpenCL1.2 driver support).
sr. member
Activity: 450
Merit: 255
Sounds interesting, I'm anxious to see what you find out.
sr. member
Activity: 728
Merit: 304
Miner Developer

I didn't need the OpenCL 1.2 ABI or HSAIL after all.
This most likely means I should be able to catch up with Optiminer.
Good stuff, good stuff.

You wouldn't have had much luck with HSAIL anyway; I'm pretty sure I already mentioned there's no GDS instructions in HSAIL.


Really? I don't recall that... The ROCm ABI does expose GDS, though. I will doublecheck.
sr. member
Activity: 728
Merit: 304
Miner Developer
I added a new pseudo-op for Global Data Share (GDS) to CLRadeonExtender:

https://github.com/CLRX/CLRX-mirror/pull/11

It will be so much fun if we can freely exploit this killer feature at last...
sr. member
Activity: 728
Merit: 304
Miner Developer
The miner is running stably with 2 threads with a 32KB GDS segment each. Very cool...
sr. member
Activity: 728
Merit: 304
Miner Developer
@zawawa Just in case, I'll ask: Are you working on GPUs other than RX4xx? I ask because that's the only GPU that anyone has even mentioned in this thread. How about R9 Fury/Nano, for instance? 290x? Etc..? In any case, thank you for all the effort you've given to this!


I am currently focusing on RX 480, but I am planning to work on other cards once I'm done with it.
sr. member
Activity: 305
Merit: 250
@zawawa Just in case, I'll ask: Are you working on GPUs other than RX4xx? I ask because that's the only GPU that anyone has even mentioned in this thread. How about R9 Fury/Nano, for instance? 290x? Etc..? In any case, thank you for all the effort you've given to this!


390x use for eth
full member
Activity: 150
Merit: 100
@zawawa Just in case, I'll ask: Are you working on GPUs other than RX4xx? I ask because that's the only GPU that anyone has even mentioned in this thread. How about R9 Fury/Nano, for instance? 290x? Etc..? In any case, thank you for all the effort you've given to this!
Jump to: