DiaKGCN kernel for CGMINER + Phoenix 2 (79XX / 78XX / 77XX / GCN) - 2012-05-25 - page 4.

gat3way

sr. member

Activity: 256

Merit: 250

There is no "native 16-component vectors support" in any AMD GPU hardware, including GCN. OpenCL vectors are just a software abstraction that does not map directly on hardware. Furthermore, hardware is not SIMD (GCN's vector ALU units are more like SIMD, but they are _not_ 16-wide nevertheless). It would be rather naive and easy if vector operations were directly mapped to hardware capabilities but it's not the case. You could for example imagine the VLIW4 or VLIW5 architecture operating as 4-wide or 5-wide SIMD unit and that sounds pretty logical, but that does not happen in reality.

To emulate 16-component vectors, VLIW bundles are generated in a way that 16 ALU operations are being performed rather than say 4. Which means that if one or two VLIW bundles were generated for 4-wide vector ALU operation, 4 or more bundles would be generated for a 16-wide vector ALU operation. The only benefit of doing this is tighter ALUPacking which is not very relevant on 6xxx. In most cases though, the difference in ALUPacking between 4-component vectors and wider ones is negligible if your code is written so that needless dependencies are eliminated.

Unfortunately though, wider vectors mean more GPRs wasted and more GPRs wasted mean less wavefronts per CU. So in most cases, wider vectors mean slower kernels due to lower occupancy. There is a nice table on the AMD APP SDK programming guide concerning the correlation of GPRs used to wavefronts/CU.

There are some cases where uint16 might in fact improve performance - like simple kernels that execute fast and time is wasted on kernel scheduling and data transfers - in that case using uint16 means more work per kernel invocation and the overall effect is better when you weight it against increased GPR usage. Bitcoin kernels though are not such a case.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Anyhow, going back to what I was saying, Dia, I think that the best kernel design for GCN is one which will compute four 512-byte integers. Since it can compute one in 4 cycles or 4 in 4 cycles, it seems best to attempt to compute 4 sets of 16 vectors to the fullest extent of the ALUs. Alternatively, you could compute 3 sets and leave the last SIMD for computing other works required by the kernel such as nonce and the like. So, multi-threading is brought into play with the GCN processors. Cool

The problem is that these aren't multi-GPUs, these are multi-SIMD GPUs which makes coding a little more tricky.
I might be a little over-zealous to think that these are capable of handling four times the amount of mining at one time, but it seems like the approach to take.
The biggest suggestion I could make, though, is to drop the worksize down to allow for the increased vectors. You should see some improvement with VECTORS8, but I can't promise it so.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: Diapolo on February 05, 2012, 11:44:38 AM

Quote from: d3m0n1q_733rz on February 05, 2012, 04:17:36 AM

Just so you know, the ATI cards are capable of handling up to 16 vectors that I'm aware of. I'm not going to try this right now, but it'll supposedly cut-down on the amount of work that's required to be done. Higher-end cards will, of course, see better results than lower-end ones. I don't know what the physical computing size is for the data, but it'll handle int16 which should be best for dedicated rigs as long as the worksize is dropped to about half of the hardware's limit from what I see here.

I could implement uint16, should be pretty straight forward, but massive vectorisation is really something GCN does not like currently.

Dia

http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/3
I think it might be because the full 16 vectors are loaded and unloaded to make room for anything else that needs to be computed. Undecided

In theory, 16 vectors at once is the best approach, but that only applies if we're doing math for only the 16 vectors as that's the maximum the ALUs can hold.
In other words, the moment something else needs to be loaded, it has to pull the entire 512-byte integer from the ALUs to put into the cache, load the data to be computed, unload it, then reload the 512-byte integers. But the GCN is supposedly a true 16 vector design so I think the problem is the overhead that's created loading and unloading. With the 8 vectors, did you try the worksize of 64 to see if it was any faster?

PS Bad news--new Phoenix 2 miner. I've suggested they make changes to the phatk2 kernel like you've made for your GCN here. Like adding the GOFFSET option and increasing vector sizes.

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: d3m0n1q_733rz on February 05, 2012, 04:17:36 AM

Just so you know, the ATI cards are capable of handling up to 16 vectors that I'm aware of. I'm not going to try this right now, but it'll supposedly cut-down on the amount of work that's required to be done. Higher-end cards will, of course, see better results than lower-end ones. I don't know what the physical computing size is for the data, but it'll handle int16 which should be best for dedicated rigs as long as the worksize is dropped to about half of the hardware's limit from what I see here.

I could implement uint16, should be pretty straight forward, but massive vectorisation is really something GCN does not like currently.

Dia

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Just so you know, the ATI cards are capable of handling up to 16 vectors that I'm aware of. I'm not going to try this right now, but it'll supposedly cut-down on the amount of work that's required to be done. Higher-end cards will, of course, see better results than lower-end ones. I don't know what the physical computing size is for the data, but it'll handle int16 which should be best for dedicated rigs as long as the worksize is dropped to about half of the hardware's limit from what I see here.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

VECTORS4 WORKSIZE=128 with GOFFSET=false 14.45 Mhash/s
VECTORS4 WORKSIZE=128 without GOFFSET=false 14.46 Mhash/s
VECTORS8 WORKSIZE=128 with GOFFSET=false 14.46 Mhash/s
VECTORS8 WORKSIZE=128 without GOFFSET=false 14.47 Mhash/s

VECTORS4 WORKSIZE=64 with GOFFSET=false 14.49 Mhash/s
VECTORS4 WORKSIZE=64 without GOFFSET=false 14.50 Mhash/s
VECTORS8 WORKSIZE=64 with GOFFSET=false 14.55 Mhash/s
VECTORS8 WORKSIZE=64 without GOFFSET=false 14.50 Mhash/s

VECTORS4 WORKSIZE=32 with GOFFSET=false 14.46 Mhash/s
VECTORS4 WORKSIZE=32 without GOFFSET=false 14.47 Mhash/s
VECTORS8 WORKSIZE=32 with GOFFSET=false 14.50 Mhash/s
VECTORS8 WORKSIZE=32 without GOFFSET=false 14.48 Mhash/s

*High fives* Playing around with VECTORS8 has done some good. ^_^ And hardly anyone believed me that using 256-byte integers could pay-off.
I'm going to "try" to do something with the nonce code in phatk2 by copying the nonce code from your kernel and see what happens. I really wouldn't have known how to do it without you.

blissfulyoshi

newbie

Activity: 11

Merit: 0

I think that posting sytle looks nice, I'll copy.

2.5 (6870 on 11.11)

Code:

VECTORS4 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128

Average: 284Mhash/s

Code:

VECTORS8 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128

Average: 284Mhash/s

2.6 (6870 on 11.12, 50MHz slower on the GPU clock than the one on 2.5)

Code:

VECTORS2 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128

Average: 262Mhash/s

Code:

VECTORS4 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128

Average: 275Mhash/s

Code:

VECTORS8 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128

Average: 268Mhash/s

TurdHurdur

full member

Activity: 216

Merit: 100

Quote from: Diapolo on February 04, 2012, 12:57:43 PM

Would you mind to try the VECTORS8 version and report back?

Dia

I'm using Catalyst 12.1, 875/1225 clocks, same manufacturer/model 5870s on Windows 7.

https://bitcointalksearch.org/topic/m.718648 kernel:

Code:

VECTORS4 FASTLOOP=false AGGRESSION=10 WORKSIZE=128 BFI_INT

Max: ~400Mhash/s

Code:

VECTORS4 AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

Max: ~390Mhas/s

Your new diakcgn kernel:

Code:

VECTORS8 FASTLOOP=false AGGRESSION=10 WORKSIZE=128 BFI_INT

Max: ~354Mhash/s

Code:

VECTORS8 AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

Max: ~352Mhash/s

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: TurdHurdur on February 04, 2012, 12:36:07 PM

Quote from: d3m0n1q_733rz on February 04, 2012, 11:52:20 AM

What did you get apples to apples? As in using VECTORS4 with phatk2?

Oh, crap, significantly more. Guess I should've tried phatk2 with VECTORS4 before...

Edit: Blah, ignore above postings. I've been editing the kernel files.

Would you mind to try the VECTORS8 version and report back?

Dia

Diapolo

hero member

Activity: 772

Merit: 500

download current version:
http://www.filedropper.com/diakgcn04-02-2012

This version features uint8 vectors support, which is activated via VECTORS8 switch. This was beneficial on my VLIW5 6550D, but is pretty slow on GCN. Another switch GOFFSET was added, which can be used to disable the automatic usage of the global offset parameter (use GOFFSET=false to disable global offset). Perhaps it's faster for some to use the old way of generating the nonces in the kernel, so play around with it

.

Dia

TurdHurdur

full member

Activity: 216

Merit: 100

Quote from: d3m0n1q_733rz on February 04, 2012, 11:52:20 AM

What did you get apples to apples? As in using VECTORS4 with phatk2?

Oh, crap, significantly more. Guess I should've tried phatk2 with VECTORS4 before...

Edit: Blah, ignore above postings. I've been editing the kernel files.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: TurdHurdur on February 04, 2012, 11:32:12 AM

I get ~10Mhash/s more on my 5870 using:

Code:

VECTORS4 AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

with this kernel for regular desktop usage.
Compared to my old phatk2 line:

Code:

VECTORS AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

Though, this kernel doesn't seem to help my higher-aggression-set card(also a 5870) in my crossfire setup compared to your 2011-12-21 phatk_dia.

What did you get apples to apples? As in using VECTORS4 with phatk2?

TurdHurdur

full member

Activity: 216

Merit: 100

I get ~10Mhash/s more on my 5870 using:

Code:

VECTORS4 AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

with this kernel for regular desktop usage.
Compared to my old phatk2 line:

Code:

VECTORS AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

Though, this kernel doesn't seem to help my higher-aggression-set card(also a 5870) in my crossfire setup compared to your 2011-12-21 phatk_dia.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Also, if I may, it doesn't look like uu needs to be set for GOFFSET as base doesn't appear to even be used. I'm guessing that was your intention in the first place.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Just tried your newer kernel and just about crapped myself. I'm seeing some very competitive numbers with phatk2 now and I love the verbosity. I see you decided to move the (u) values from the bitselect. Did that help to speed things along? I figured that if BFI_INT didn't have them, there was a major difference in something and one of them had to be slower.
I like how you used the global offset to your advantage. (GOFFSET)
I am impressed. You've been busy and I can see why. If I was capable of hashing faster, I would totally send you some coin for your efforts. Given a few months, I should.

blissfulyoshi

newbie

Activity: 11

Merit: 0

More testing!!!!!

2.5
DiakGCN VECTORS WORKSIZE=128: 247MHps
DiakGCN VECTORS2 WORKSIZE=128: 280-281MHps
DiakGCN VECTORS4 WORKSIZE=128: 284MHps....It looks like your old Phoenix kernal is finally beaten for me. Now just need to surpass my cgminer scores of 290MHps xD

CPU at 25-30% on my C2D.

Increases all across the board. Congratz!

Diapolo

hero member

Activity: 772

Merit: 500

download current version:
http://www.filedropper.com/diakgcn03-02-2012_1

This release fixes the bugged VECTORS4 code, which works again (tested on 7970 and 6550D) and could speedup things for VLIW4 / VLIW5 GPUs with WORKSIZE=128, just try it. There are no further changes for GCN in conjunction with VECTORS2 since 03-02-2012.

Dia

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: sveetsnelda on February 02, 2012, 05:42:27 PM

I tried it, and I'm sorry that I haven't reported back. Work has been chaotic.

The only way that I can get a similar hashrate compared to DiabloMiner with this kernel is to use a very high intensity (greater than 10). By doing this though, CPU usage skyrockets and I burn up more wattage than the hashrate increase is worth. I can make a few changes to the Poclbm kernel included with CGMiner though and get 96 percent of the performance of DiabloMiner while leaving the intensity at 9. By using CGMiner, I am able to use a backup pools, RPC, thermal controls, etc, etc. This more than makes up for the ~4 percent loss in performance. I'm not at home right now to look at every change, but defining the Ch and Ma functions to use Bitselect is basically all that was needed.

I'll try to send you a PM tonight with more details.

I really would like to port this one into CGMiner (or help in getting it ported), but I did not have the time to do so AND I guess I need help in doing commits for CGMiner. I will send a PM to Con, perhaps he is interested ...

By the way, I use AGGRESSION=12 with this kernel and get ~75% utilization on 1 core. Not good, but could be worse Cheesy

!

Dia

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: blissfulyoshi on February 02, 2012, 07:41:43 PM

What I was asking about the name earlier in the previous thread is why the naming of this version changed?

current: diaggcn
thread title/previous: diakgcn

Oh well, minor thing, just changed the name of my inputs into phoenix. Keep up the good work.

ROFL ... I did a typing error, wow that is hard. Will upload a fixed one asap Cheesy

. Sorry for the confusion, yesterday was a bit hard Cheesy

.

Update: Fixed my typo ;-), download is back!

Dia

blissfulyoshi

newbie

Activity: 11

Merit: 0

What I was asking about the name earlier in the previous thread is why the naming of this version changed?

current: diaggcn
thread title/previous: diakgcn

Oh well, minor thing, just changed the name of my inputs into phoenix. Keep up the good work.

Topic: DiaKGCN kernel for CGMINER + Phoenix 2 (79XX / 78XX / 77XX / GCN) - 2012-05-25 - page 4. (Read 27837 times)