Pages:
Author

Topic: DiaKGCN kernel for CGMINER + Phoenix 2 (79XX / 78XX / 77XX / GCN) - 2012-05-25 - page 3. (Read 27827 times)

sr. member
Activity: 378
Merit: 250
Right, so not adding uint doesn't cause an offset?
I ran into something with the Phatk2 kernel which added uint as PreVal0 and PreVal4 that were required otherwise the offset caused problems (as far as I can tell anyway).  So I just wanted to be sure.
hero member
Activity: 772
Merit: 500
No, (u) is not a variable, it's only used as a type-cast in front of variables, where u is replaced by uint, uint2, uint4 or uint8 (depends on the used vector width).

Example:

#define u uint
ulong Test_ulong = 17
uint Test_uint = (u)Test_ulong -> replaced with uint Test_uint = (uint)Test_ulong

Dia
sr. member
Activity: 378
Merit: 250
Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code Smiley. Thanks for sharing your idea!

Dia
Just remember to use #ifndef GOFFSET with the constant that sets base to uu.
I've also added GOFFSET to Phatk2!  *Happy dance*

It's tested and works.

Nice work Smiley, got your version faster on your machine?

Dia
I'm still playing around with it.  I think I'll take a nap and see how well it does.  I'm hoping that it'll work better, but I'm still moving things around.  I'm having trouble keeping Vectors8 from spilling over into memory though.
As for your GCN version, I haven't messed with it as much lately.  I'll get back to it soon.  I'm just wanting to even things out and see what works best with certain ideas.

Also, in your code here:
Code:
#else
#ifdef GOFFSET
u nonce = (uint)get_global_id(0);
#else
u nonce = ((uint)get_group_id(0) * (uint)WORKSIZExVECSIZE) + (uint)get_local_id(0) + base;
Should we add (u) to the GOFFSET nonce?
Code:
#else
#ifdef GOFFSET
u nonce = (uint)get_global_id(0) + (u);
#else
u nonce = ((uint)get_group_id(0) * (uint)WORKSIZExVECSIZE) + (uint)get_local_id(0) + base;
  Like this
hero member
Activity: 772
Merit: 500
Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code Smiley. Thanks for sharing your idea!

Dia
Just remember to use #ifndef GOFFSET with the constant that sets base to uu.
I've also added GOFFSET to Phatk2!  *Happy dance*

It's tested and works.

Nice work Smiley, got your version faster on your machine?

Dia
sr. member
Activity: 378
Merit: 250
Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code Smiley. Thanks for sharing your idea!

Dia
Just remember to use #ifndef GOFFSET with the constant that sets base to uu.
I've also added GOFFSET to Phatk2!  *Happy dance*

It's tested and works.
hero member
Activity: 772
Merit: 500
Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code Smiley. Thanks for sharing your idea!

Dia
sr. member
Activity: 378
Merit: 250
Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.
hero member
Activity: 772
Merit: 500
Check my git tree. There's a diakgcn branch now. Only the output side is different, but I can't get shares out of it yet. No support for new macros yet either.
Investigate the imported diakgcn. The only significant change is to the output code, but I get no  shares yet...
https://github.com/ckolivas/cgminer/blob/diakgcn/diakgcn120208.cl

Cool, I'll take a look at it ...

Dia
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Check my git tree. There's a diakgcn branch now. Only the output side is different, but I can't get shares out of it yet. No support for new macros yet either.
Investigate the imported diakgcn. The only significant change is to the output code, but I get no  shares yet...
https://github.com/ckolivas/cgminer/blob/diakgcn/diakgcn120208.cl
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Check my git tree. There's a diakgcn branch now. Only the output side is different, but I can't get shares out of it yet. No support for new macros yet either.
hero member
Activity: 772
Merit: 500
So how are we going diapolo? Is your kernel ready for me to port it to cgminer Wink

Hey Con,

It's ready for getting assimilated Cheesy, only thing is I really need your help for this. There are some differences in the supplied kernel variables and compiler arguments, which we should take a look at. Another difference is the output buffer, which is currently not compatible to the CGMINER code (but could be changed rather easy). I added another method of nonce calculation via OpenCL 1.1 global offset, so a flag or function to detect OpenCL 1.1 would be needed in the CGMINER API.

At the end of the week I should have a bit more time, than I have now, but the phase of planning can start as soon as you give me a go (and take me by the hand ^^). What would you suggest as a first step?

Dia
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
So how are we going diapolo? Is your kernel ready for me to port it to cgminer Wink
hero member
Activity: 772
Merit: 500
If you guys did not see it, there is a new Phoenix 2 beta for which I released a DiaKGCN preview, see here:
https://bitcointalksearch.org/topic/m.734465

Dia
sr. member
Activity: 378
Merit: 250
Nope. There is no reason extremely high (what's extreme btw? 256?) worksizes would work best. My bet would be 64 would work best. I might not be correct about this since I lack data and I lack data cause I am too lazy to profile current bitcoin kernels Smiley

The dumps I posted you are from Tahiti architecture, you can see no VLIW bundles and no clauses there and the GCN ISA is clearly different from the VLIW one Smiley Also, there is no reason why 2-component vectors would work best on Tahiti. Why do you think uint2 would work best? I don't think so. It might work well. It might not. You have run that through SKA or sprofile?

Once again (I am kinda tired of this so I am not going to reinstate that anymore), there is no 1:1 mapping between OpenCL vectors and the 79xx's vector ALU units. 79xx cannot "use more vectors" as "using more vectors" does not mean "use less instructions" on GCN hardware. Though frankly said I don't see a reason why am I arguing about that. Actually you are free to profile and benchmark. Again, do profile Smiley


>_<  That's what I'm trying to do.  I'm trying to allow it to use all 16 vectors at once instead of using smaller vectors to achieve the same thing.  In this way, the instructions aren't repeated and the overhead is removed.  But I've seen best results while using 8 vectors and a worksize of 64 or 32.
sr. member
Activity: 256
Merit: 250
Nope. There is no reason extremely high (what's extreme btw? 256?) worksizes would work best. My bet would be 64 would work best. I might not be correct about this since I lack data and I lack data cause I am too lazy to profile current bitcoin kernels Smiley

The dumps I posted you are from Tahiti architecture, you can see no VLIW bundles and no clauses there and the GCN ISA is clearly different from the VLIW one Smiley Also, there is no reason why 2-component vectors would work best on Tahiti. Why do you think uint2 would work best? I don't think so. It might work well. It might not. You have run that through SKA or sprofile?

Once again (I am kinda tired of this so I am not going to reinstate that anymore), there is no 1:1 mapping between OpenCL vectors and the 79xx's vector ALU units. 79xx cannot "use more vectors" as "using more vectors" does not mean "use less instructions" on GCN hardware. Though frankly said I don't see a reason why am I arguing about that. Actually you are free to profile and benchmark. Again, do profile Smiley

sr. member
Activity: 378
Merit: 250
You don't recognize Cayman ISA from Tahiti ISA? Smiley

Well, actually this is cross-compiled using the cl_amd_offline_devices extension. It is an AMD extension that lets you compile binary kernels for all hardware supported by the driver. The system I got the dumps and built kernels on is a 6870 one. It does not matter though as the generated binary is the same as the one you would get from clBuildProgram() on 79xx.
Actually, I was talking about the Tahiti vs. Cayman as Tahiti uses GCN and Cayman uses VLIW.  The Tahiti GPU is different from the others of the 7xxx series because it's based on the GCN architecture which contains four full 16-wide vector units.  From what you've told me, the best settings for the 79xx series cards will be with 2 vectors and extremely high worksizes, but that data is based on the lower-end VLIW-based 7xxx cards in the series.  Since it takes 8 cycles to complete a group on VLIW, vectorization seems to be a good option.
The document you posted talks about the Southern Island cards in general, but not the specifics of the 16-wide vectors.
My argument is that VLIW can use more instructions, but GCN can use more vectors.  The idea I'm trying to convey is to keep the vectors high and the instructions required to be used on them low.  But I can't seem to avoid the darn spillover in the registers.  >_<
sr. member
Activity: 256
Merit: 250
You don't recognize Cayman ISA from Tahiti ISA? Smiley

Well, actually this is cross-compiled using the cl_amd_offline_devices extension. It is an AMD extension that lets you compile binary kernels for all hardware supported by the driver. The system I got the dumps and built kernels on is a 6870 one. It does not matter though as the generated binary is the same as the one you would get from clBuildProgram() on 79xx.
sr. member
Activity: 378
Merit: 250
sr. member
Activity: 256
Merit: 250
I'm telling you again, you've gotten that wrong. The vector ALU unit on GCNs is not meant to map 1:1 with opencl's vectors. The GCN architecture is scalar in nature. The purpose of vector ALU units is to handle ALU operations that are handled per-workitem rather than those that are handled on a per-workgroup basis. The vector ALU operations take 4 cycles to execute as compared to the 1 cycle on the scalar unit. There might be an advantage to vectorization in some cases but that's not because the vector unit behaves as a 16-wide SIMD unit (which is wrong btw). The vector unit "appears" to operate as a SIMD one, but that comes at the price of the instruction latency.

There is now a section on GCN architecture on the official APP SDK documentation:

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Along with everything else, it is clearly stated there:

Quote
Notes –
•   Vectorization is no longer needed, nor desirable. If your code used to
    combine work-items in order to get better VLIW use, this is no longer
    required.


Anyway, this can be easily demonstrated. Here is a very simple OpenCL kernel that shifts a kernel argument and writes it into an output buffer. This is with uint16 vectors:


Code:
__kernel void test(uint16 in,__global uint16 *dest)
{
dest[get_global_id(0)] = in>>2;
}

Here is the ISA dump:

Code:
  s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dwordx4  s[8:11], s[12:15], 0x00            // 00000008: C2840D00
  s_buffer_load_dwordx4  s[20:23], s[12:15], 0x04           // 0000000C: C28A0D04
  s_buffer_load_dwordx4  s[24:27], s[12:15], 0x08           // 00000010: C28C0D08
  s_buffer_load_dwordx4  s[28:31], s[12:15], 0x0c           // 00000014: C28E0D0C
  s_buffer_load_dword  s2, s[12:15], 0x10                   // 00000018: C2010D10
  s_waitcnt     lgkmcnt(0)                                  // 0000001C: BF8C007F
  s_mul_i32     s0, s16, s0                                 // 00000020: 93000010
  s_add_i32     s0, s0, s1                                  // 00000024: 81000100
  v_add_i32     v0, vcc, s0, v0                             // 00000028: 4A000000
  s_lshr_b32    s0, s8, 2                                   // 0000002C: 90008208
  s_lshr_b32    s1, s9, 2                                   // 00000030: 90018209
  s_lshr_b32    s3, s10, 2                                  // 00000034: 9003820A
  s_lshr_b32    s8, s11, 2                                  // 00000038: 9008820B
  v_lshlrev_b32  v0, 6, v0                                  // 0000003C: 34000086
  v_add_i32     v0, vcc, s2, v0                             // 00000040: 4A000002
  v_mov_b32     v1, s0                                      // 00000044: 7E020200
  v_mov_b32     v2, s1                                      // 00000048: 7E040201
  v_mov_b32     v3, s3                                      // 0000004C: 7E060203
  v_mov_b32     v4, s8                                      // 00000050: 7E080208
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBF71000 80010100
  s_lshr_b32    s0, s28, 2                                  // 0000005C: 9000821C
  s_lshr_b32    s1, s29, 2                                  // 00000060: 9001821D
  s_lshr_b32    s2, s30, 2                                  // 00000064: 9002821E
  s_lshr_b32    s3, s31, 2                                  // 00000068: 9003821F
  s_waitcnt     expcnt(0)                                   // 0000006C: BF8C1F0F
  v_mov_b32     v1, s0                                      // 00000070: 7E020200
  v_mov_b32     v2, s1                                      // 00000074: 7E040201
  v_mov_b32     v3, s2                                      // 00000078: 7E060202
  v_mov_b32     v4, s3                                      // 0000007C: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:48 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000080: EBF71030 80010100
  s_lshr_b32    s0, s24, 2                                  // 00000088: 90008218
  s_lshr_b32    s1, s25, 2                                  // 0000008C: 90018219
  s_lshr_b32    s2, s26, 2                                  // 00000090: 9002821A
  s_lshr_b32    s3, s27, 2                                  // 00000094: 9003821B
  s_waitcnt     expcnt(0)                                   // 00000098: BF8C1F0F
  v_mov_b32     v1, s0                                      // 0000009C: 7E020200
  v_mov_b32     v2, s1                                      // 000000A0: 7E040201
  v_mov_b32     v3, s2                                      // 000000A4: 7E060202
  v_mov_b32     v4, s3                                      // 000000A8: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:32 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000AC: EBF71020 80010100
  s_lshr_b32    s0, s20, 2                                  // 000000B4: 90008214
  s_lshr_b32    s1, s21, 2                                  // 000000B8: 90018215
  s_lshr_b32    s2, s22, 2                                  // 000000BC: 90028216
  s_lshr_b32    s3, s23, 2                                  // 000000C0: 90038217
  s_waitcnt     expcnt(0)                                   // 000000C4: BF8C1F0F
  v_mov_b32     v1, s0                                      // 000000C8: 7E020200
  v_mov_b32     v2, s1                                      // 000000CC: 7E040201
  v_mov_b32     v3, s2                                      // 000000D0: 7E060202
  v_mov_b32     v4, s3                                      // 000000D4: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:16 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000D8: EBF71010 80010100
  s_endpgm                                                  // 000000E0: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 228;Bytes

userElementCount     = 3;
;  userElements[0]    = IMM_UAV, 10, s[4:7]
;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]
;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 5;
NumSgprs             = 32;



Now there is the uint4 version:


Code:
__kernel void test(uint4 in,__global uint4 *dest)
{
dest[get_global_id(0)] = in>>2;
}


Code:
  s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dwordx4  s[8:11], s[12:15], 0x00            // 00000008: C2840D00
  s_buffer_load_dword  s2, s[12:15], 0x04                   // 0000000C: C2010D04
  s_waitcnt     lgkmcnt(0)                                  // 00000010: BF8C007F
  s_mul_i32     s0, s16, s0                                 // 00000014: 93000010
  s_add_i32     s0, s0, s1                                  // 00000018: 81000100
  v_add_i32     v0, vcc, s0, v0                             // 0000001C: 4A000000
  s_lshr_b32    s0, s8, 2                                   // 00000020: 90008208
  s_lshr_b32    s1, s9, 2                                   // 00000024: 90018209
  s_lshr_b32    s3, s10, 2                                  // 00000028: 9003820A
  s_lshr_b32    s8, s11, 2                                  // 0000002C: 9008820B
  v_lshlrev_b32  v0, 4, v0                                  // 00000030: 34000084
  v_add_i32     v0, vcc, s2, v0                             // 00000034: 4A000002
  v_mov_b32     v1, s0                                      // 00000038: 7E020200
  v_mov_b32     v2, s1                                      // 0000003C: 7E040201
  v_mov_b32     v3, s3                                      // 00000040: 7E060203
  v_mov_b32     v4, s8                                      // 00000044: 7E080208
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000048: EBF71000 80010100
  s_endpgm                                                  // 00000050: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 84;Bytes

userElementCount     = 3;
;  userElements[0]    = IMM_UAV, 10, s[4:7]
;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]
;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 5;
NumSgprs             = 18;



As you can see, the IL->ISA backend did not even bother to map the vector operations to the vector unit, it rather used the scalar unit exclusively. The first version does the 16 scalar shifts and wastes 32 SGPRs, the second one does the 4 scalar shifts and wastes 18 SGPRs.

Now before you say "why is it behaving like that", there are several reasons for this. Once again, stop thinking about OpenCL as something that should map 1:1 to hardware. OpenCL is a high-level API. Even with IL, you don't have that control. You cannot directly influence how is the backend going to map on the hardware.

As for your improved results, I would advise you to have a look at the python host code and/or the share rate as reported by the pool you are using. It is likely that the progress indicator is not reporting the correct speed for some reason (wrong NDRange calculation, wrong divisor or something like that). I've done those experiments in the past with mine and others' bitcoin kernels and in all cases, the kernel performance dropped abruptly with vectorization above 4 (due to reduced occupancy).
sr. member
Activity: 378
Merit: 250
There is no "native 16-component vectors support" in any AMD GPU hardware, including GCN. OpenCL vectors are just a software abstraction that does not map directly on hardware. Furthermore, hardware is not SIMD (GCN's vector ALU units are more like SIMD, but they are _not_ 16-wide nevertheless). It would be rather naive and easy if vector operations were directly mapped to hardware capabilities but it's not the case. You could for example imagine the VLIW4 or VLIW5 architecture operating as 4-wide or 5-wide SIMD unit and that sounds pretty logical, but that does not happen in reality.

To emulate 16-component vectors, VLIW bundles are generated in a way that 16 ALU operations are being performed rather than say 4. Which means that if one or two VLIW bundles were generated for 4-wide vector ALU operation, 4 or more bundles would be generated for a 16-wide vector ALU operation. The only benefit of doing this is tighter ALUPacking which is not very relevant on 6xxx. In most cases though, the difference in ALUPacking between 4-component vectors and wider ones is negligible if your code is written so that needless dependencies are eliminated.

Unfortunately though, wider vectors mean more GPRs wasted and more GPRs wasted mean less wavefronts per CU. So in most cases, wider vectors mean slower kernels due to lower occupancy. There is a nice table on the AMD APP SDK programming guide concerning the correlation of GPRs used to wavefronts/CU.


There are some cases where uint16 might in fact improve performance - like simple kernels that execute fast and time is wasted on kernel scheduling and data transfers - in that case using uint16 means more work per kernel invocation and the overall effect is better when you weight it against increased GPR usage. Bitcoin kernels though are not such a case.
Alright, but when it came to 8 vectors, you can't argue with results.  I've posted the table of gains with the VLIW5 hardware I use.  And please read the papers on the GCN again (assuming you read them once) as it's clearly stated that, "Not to be confused with the SIMD on Cayman (which is a collection of SPs), the SIMD on GCN is a true 16-wide vector SIMD. A single instruction and up to 16 data elements are fed to a vector SIMD to be processed over a single clock cycle. As with Cayman, AMD’s wavefronts are 64 instructions meaning it takes 4 cycles to actually complete a single instruction for an entire wavefront.  This vector unit is combined with a 64KB register file and that composes a single SIMD in GCN."
Now, as I was saying, since the SIMDs are 16-wide and there are 4 of them.  Each SIMD could be loaded with 16 vectors each which would allow the calculations to be run on all of them without wasting any clock cycles.  Four 16-vectors at once sounds pretty good to me.
The Cayman takes 4 clock cycles due to SPs being used.  The GCN handles them in one.  You do the math.
Now, I don't know why Dia's been getting lower hash results with 8 vectors having more ALUs to handle them.  But I have the HD5450 and I get the highest hashing rate using 8 vectors and a worksize of 64.  You can find my results on the previous page.  Oddly enough, it's on the VLIW5 which isn't 8-wide.  8-wide would be using half of the ALUs of a single SIMD on a GCN.  So what gives?

BTW, I'm talking about the 79xx series.
Pages:
Jump to: