Pages:
Author

Topic: DiaKGCN kernel for CGMINER + Phoenix 2 (79XX / 78XX / 77XX / GCN) - 2012-05-25 - page 4. (Read 27837 times)

sr. member
Activity: 256
Merit: 250
There is no "native 16-component vectors support" in any AMD GPU hardware, including GCN. OpenCL vectors are just a software abstraction that does not map directly on hardware. Furthermore, hardware is not SIMD (GCN's vector ALU units are more like SIMD, but they are _not_ 16-wide nevertheless). It would be rather naive and easy if vector operations were directly mapped to hardware capabilities but it's not the case. You could for example imagine the VLIW4 or VLIW5 architecture operating as 4-wide or 5-wide SIMD unit and that sounds pretty logical, but that does not happen in reality.

To emulate 16-component vectors, VLIW bundles are generated in a way that 16 ALU operations are being performed rather than say 4. Which means that if one or two VLIW bundles were generated for 4-wide vector ALU operation, 4 or more bundles would be generated for a 16-wide vector ALU operation. The only benefit of doing this is tighter ALUPacking which is not very relevant on 6xxx. In most cases though, the difference in ALUPacking between 4-component vectors and wider ones is negligible if your code is written so that needless dependencies are eliminated.

Unfortunately though, wider vectors mean more GPRs wasted and more GPRs wasted mean less wavefronts per CU. So in most cases, wider vectors mean slower kernels due to lower occupancy. There is a nice table on the AMD APP SDK programming guide concerning the correlation of GPRs used to wavefronts/CU.


There are some cases where uint16 might in fact improve performance - like simple kernels that execute fast and time is wasted on kernel scheduling and data transfers - in that case using uint16 means more work per kernel invocation and the overall effect is better when you weight it against increased GPR usage. Bitcoin kernels though are not such a case.
sr. member
Activity: 378
Merit: 250
Anyhow, going back to what I was saying, Dia, I think that the best kernel design for GCN is one which will compute four 512-byte integers.  Since it can compute one in 4 cycles or 4 in 4 cycles, it seems best to attempt to compute 4 sets of 16 vectors to the fullest extent of the ALUs.  Alternatively, you could compute 3 sets and leave the last SIMD for computing other works required by the kernel such as nonce and the like.  So, multi-threading is brought into play with the GCN processors.   Cool
The problem is that these aren't multi-GPUs, these are multi-SIMD GPUs which makes coding a little more tricky.
I might be a little over-zealous to think that these are capable of handling four times the amount of mining at one time, but it seems like the approach to take.
The biggest suggestion I could make, though, is to drop the worksize down to allow for the increased vectors.  You should see some improvement with VECTORS8, but I can't promise it so.
sr. member
Activity: 378
Merit: 250
Just so you know, the ATI cards are capable of handling up to 16 vectors that I'm aware of.  I'm not going to try this right now, but it'll supposedly cut-down on the amount of work that's required to be done.  Higher-end cards will, of course, see better results than lower-end ones.  I don't know what the physical computing size is for the data, but it'll handle int16 which should be best for dedicated rigs as long as the worksize is dropped to about half of the hardware's limit from what I see here.

I could implement uint16, should be pretty straight forward, but massive vectorisation is really something GCN does not like currently.

Dia
http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/3
I think it might be because the full 16 vectors are loaded and unloaded to make room for anything else that needs to be computed.   Undecided
In theory, 16 vectors at once is the best approach, but that only applies if we're doing math for only the 16 vectors as that's the maximum the ALUs can hold.
In other words, the moment something else needs to be loaded, it has to pull the entire 512-byte integer from the ALUs to put into the cache, load the data to be computed, unload it, then reload the 512-byte integers.  But the GCN is supposedly a true 16 vector design so I think the problem is the overhead that's created loading and unloading.  With the 8 vectors, did you try the worksize of 64 to see if it was any faster?

PS Bad news--new Phoenix 2 miner.  I've suggested they make changes to the phatk2 kernel like you've made for your GCN here.  Like adding the GOFFSET option and increasing vector sizes.
hero member
Activity: 772
Merit: 500
Just so you know, the ATI cards are capable of handling up to 16 vectors that I'm aware of.  I'm not going to try this right now, but it'll supposedly cut-down on the amount of work that's required to be done.  Higher-end cards will, of course, see better results than lower-end ones.  I don't know what the physical computing size is for the data, but it'll handle int16 which should be best for dedicated rigs as long as the worksize is dropped to about half of the hardware's limit from what I see here.

I could implement uint16, should be pretty straight forward, but massive vectorisation is really something GCN does not like currently.

Dia
sr. member
Activity: 378
Merit: 250
Just so you know, the ATI cards are capable of handling up to 16 vectors that I'm aware of.  I'm not going to try this right now, but it'll supposedly cut-down on the amount of work that's required to be done.  Higher-end cards will, of course, see better results than lower-end ones.  I don't know what the physical computing size is for the data, but it'll handle int16 which should be best for dedicated rigs as long as the worksize is dropped to about half of the hardware's limit from what I see here.
sr. member
Activity: 378
Merit: 250
VECTORS4 WORKSIZE=128 with GOFFSET=false 14.45 Mhash/s
VECTORS4 WORKSIZE=128 without GOFFSET=false 14.46 Mhash/s
VECTORS8 WORKSIZE=128 with GOFFSET=false 14.46 Mhash/s
VECTORS8 WORKSIZE=128 without GOFFSET=false 14.47 Mhash/s

VECTORS4 WORKSIZE=64 with GOFFSET=false 14.49 Mhash/s
VECTORS4 WORKSIZE=64 without GOFFSET=false 14.50 Mhash/s
VECTORS8 WORKSIZE=64 with GOFFSET=false 14.55 Mhash/s
VECTORS8 WORKSIZE=64 without GOFFSET=false 14.50 Mhash/s

VECTORS4 WORKSIZE=32 with GOFFSET=false 14.46 Mhash/s
VECTORS4 WORKSIZE=32 without GOFFSET=false 14.47 Mhash/s
VECTORS8 WORKSIZE=32 with GOFFSET=false 14.50 Mhash/s
VECTORS8 WORKSIZE=32 without GOFFSET=false 14.48 Mhash/s

*High fives*  Playing around with VECTORS8 has done some good.  ^_^  And hardly anyone believed me that using 256-byte integers could pay-off.
I'm going to "try" to do something with the nonce code in phatk2 by copying the nonce code from your kernel and see what happens.  I really wouldn't have known how to do it without you.
newbie
Activity: 11
Merit: 0
I think that posting sytle looks nice, I'll copy.

2.5 (6870 on 11.11)
Code:
VECTORS4 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128
Average: 284Mhash/s

Code:
VECTORS8 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128
Average: 284Mhash/s

2.6 (6870 on 11.12, 50MHz slower on the GPU clock than the one on 2.5)
Code:
VECTORS2 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128
Average: 262Mhash/s

Code:
VECTORS4 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128
Average: 275Mhash/s

Code:
VECTORS8 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128
Average: 268Mhash/s
full member
Activity: 216
Merit: 100
Would you mind to try the VECTORS8 version and report back?

Dia

I'm using Catalyst 12.1, 875/1225 clocks, same manufacturer/model 5870s on Windows 7.

https://bitcointalksearch.org/topic/m.718648 kernel:
Code:
VECTORS4 FASTLOOP=false AGGRESSION=10 WORKSIZE=128 BFI_INT

Max: ~400Mhash/s

Code:
VECTORS4 AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

Max: ~390Mhas/s


Your new diakcgn kernel:

Code:
VECTORS8 FASTLOOP=false AGGRESSION=10 WORKSIZE=128 BFI_INT

Max: ~354Mhash/s

Code:
VECTORS8 AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

Max: ~352Mhash/s
hero member
Activity: 772
Merit: 500
What did you get apples to apples?  As in using VECTORS4 with phatk2?

Oh, crap, significantly more. Guess I should've tried phatk2 with VECTORS4 before...

Edit: Blah, ignore above postings. I've been editing the kernel files.

Would you mind to try the VECTORS8 version and report back?

Dia
hero member
Activity: 772
Merit: 500
download current version:
http://www.filedropper.com/diakgcn04-02-2012

This version features uint8 vectors support, which is activated via VECTORS8 switch. This was beneficial on my VLIW5 6550D, but is pretty slow on GCN. Another switch GOFFSET was added, which can be used to disable the automatic usage of the global offset parameter (use GOFFSET=false to disable global offset). Perhaps it's faster for some to use the old way of generating the nonces in the kernel, so play around with it Smiley.

Dia
full member
Activity: 216
Merit: 100
What did you get apples to apples?  As in using VECTORS4 with phatk2?

Oh, crap, significantly more. Guess I should've tried phatk2 with VECTORS4 before...

Edit: Blah, ignore above postings. I've been editing the kernel files.
sr. member
Activity: 378
Merit: 250
I get ~10Mhash/s more on my 5870 using:

Code:
VECTORS4 AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

with this kernel for regular desktop usage.
Compared to my old phatk2 line:

Code:
VECTORS AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP


Though, this kernel doesn't seem to help my higher-aggression-set card(also a 5870) in my crossfire setup compared to your 2011-12-21 phatk_dia.
What did you get apples to apples?  As in using VECTORS4 with phatk2?
full member
Activity: 216
Merit: 100
I get ~10Mhash/s more on my 5870 using:

Code:
VECTORS4 AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

with this kernel for regular desktop usage.
Compared to my old phatk2 line:

Code:
VECTORS AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP


Though, this kernel doesn't seem to help my higher-aggression-set card(also a 5870) in my crossfire setup compared to your 2011-12-21 phatk_dia.
sr. member
Activity: 378
Merit: 250
Also, if I may, it doesn't look like uu needs to be set for GOFFSET as base doesn't appear to even be used.  I'm guessing that was your intention in the first place.
sr. member
Activity: 378
Merit: 250
Just tried your newer kernel and just about crapped myself.  I'm seeing some very competitive numbers with phatk2 now and I love the verbosity.  I see you decided to move the (u) values from the bitselect.  Did that help to speed things along?  I figured that if BFI_INT didn't have them, there was a major difference in something and one of them had to be slower.
I like how you used the global offset to your advantage.  (GOFFSET)
I am impressed.  You've been busy and I can see why.  If I was capable of hashing faster, I would totally send you some coin for your efforts.  Given a few months, I should.
newbie
Activity: 11
Merit: 0
More testing!!!!!

2.5
DiakGCN VECTORS WORKSIZE=128: 247MHps
DiakGCN VECTORS2 WORKSIZE=128: 280-281MHps
DiakGCN VECTORS4 WORKSIZE=128: 284MHps....It looks like your old Phoenix kernal is finally beaten for me. Now just need to surpass my cgminer scores of 290MHps xD

CPU at 25-30% on my C2D.

Increases all across the board. Congratz!
hero member
Activity: 772
Merit: 500
download current version:
http://www.filedropper.com/diakgcn03-02-2012_1

This release fixes the bugged VECTORS4 code, which works again (tested on 7970 and 6550D) and could speedup things for VLIW4 / VLIW5 GPUs with WORKSIZE=128, just try it. There are no further changes for GCN in conjunction with VECTORS2 since 03-02-2012.

Dia
hero member
Activity: 772
Merit: 500
I tried it, and I'm sorry that I haven't reported back.  Work has been chaotic.

The only way that I can get a similar hashrate compared to DiabloMiner with this kernel is to use a very high intensity (greater than 10).  By doing this though, CPU usage skyrockets and I burn up more wattage than the hashrate increase is worth.  I can make a few changes to the Poclbm kernel included with CGMiner though and get 96 percent of the performance of DiabloMiner while leaving the intensity at 9.  By using CGMiner, I am able to use a backup pools, RPC, thermal controls, etc, etc.  This more than makes up for the ~4 percent loss in performance.  I'm not at home right now to look at every change, but defining the Ch and Ma functions to use Bitselect is basically all that was needed.

I'll try to send you a PM tonight with more details.

I really would like to port this one into CGMiner (or help in getting it ported), but I did not have the time to do so AND I guess I need help in doing commits for CGMiner. I will send a PM to Con, perhaps he is interested ...

By the way, I use AGGRESSION=12 with this kernel and get ~75% utilization on 1 core. Not good, but could be worse Cheesy!

Dia
hero member
Activity: 772
Merit: 500
What I was asking about the name earlier in the previous thread is why the naming of this version changed?

current: diaggcn
thread title/previous: diakgcn

Oh well, minor thing, just changed the name of my inputs into phoenix. Keep up the good work.

ROFL ... I did a typing error, wow that is hard. Will upload a fixed one asap Cheesy. Sorry for the confusion, yesterday was a bit hard Cheesy.

Update: Fixed my typo ;-), download is back!

Dia
newbie
Activity: 11
Merit: 0
What I was asking about the name earlier in the previous thread is why the naming of this version changed?

current: diaggcn
thread title/previous: diakgcn

Oh well, minor thing, just changed the name of my inputs into phoenix. Keep up the good work.
Pages:
Jump to: