Pages:
Author

Topic: collection for cgminer 7970 [Card as been sent! THANK YOU EVERYONE] (Read 9935 times)

legendary
Activity: 1876
Merit: 1000
you da man....


I am stuck on windows with untill I can get linuxcoin to cooperate.

at   1125/1000  I11     672Mh   I am happy with that..  I have not tried to clock it up..  think I will try that now.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
cgminer 2.3.2 is out with a new poclbm kernel I've been bashing with a mallet for a week to try and extract some more out of it, and I hit my target which is 720 MHash at 1200/1050+5% clocks intensity 11 with 12.3 amd drivers.  Grin
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Cgminer 2.2.7 out. This version fixes the bug with 12.2 ATI drivers. Reworked to use -w 64 by default on Tahiti which is worth just under 1 more MHash. 718.5 MHash now at 1200/1050+5% clocks intensity 11.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
cgminer 2.2.6 out. 2+ more mhash on 7970. It's getting harder and harder to extract much more Tongue

1200/1050+5% clocks, intensity 11 - 717 Mhash.

I'm going to have to look at what methods you're using.  I'm curious as to how the programming differs between VLIW and GCN.
Direct link to the kernel in the git tree:
https://github.com/ckolivas/cgminer/blob/master/poclbm120214.cl

It's using a worksize of 256, 1 vector (i.e. no vectors) and intensity 11.
sr. member
Activity: 378
Merit: 250
cgminer 2.2.6 out. 2+ more mhash on 7970. It's getting harder and harder to extract much more Tongue

1200/1050+5% clocks, intensity 11 - 717 Mhash.

I'm going to have to look at what methods you're using.  I'm curious as to how the programming differs between VLIW and GCN.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
cgminer 2.2.6 out. 2+ more mhash on 7970. It's getting harder and harder to extract much more Tongue

1200/1050+5% clocks, intensity 11 - 717 Mhash.
sr. member
Activity: 378
Merit: 250
The problem, I believe, is that the GCN cards are built to take large vector counts and perform a single instruction on them all at once.  This is in contrast to the small vector count of VLIW which can perform large instructions quickly.  So, a straight-forward 16-vector miner with simple instructions should work better than a 4-vector miner with multiple instructions.  Granted, this is speculation, but from what I've seen of the hardware specs, it should hold true.
I tried it, and a 16 vector simplest possible mining kernel had performance which was, unfortunately, appalling. GCN with SDK 2.6 (the only one it works with) really does not want any vectors at all.
Hmm, interesting.  Did you drop the worksize in your tests as well?
Absolutely. I tried all sorts of combinations. Specifically the optimum, as always, is using the card's reported max_work_size and then dividing that by the vector size. It gave the least worst performance... but we're talking 20% of the performance of running no vectors.
Weird.  I would have thought it to do better considering the 16-vector ALUs.  VLIW actually showed the best output for me so long as I used VECTORS8 and GOFFSET=false as I'm using an HD5450.  But that comes with new architectures I suppose.  I wish I could find more literature for programming for GCN, but it's so new that I can't find much.  Combine that with not being able to test the programming, and I might as well stick with modifying Phatk2 (which is taking a while considering life's little disruptions lately).  I'll save-up for a 7970 and try to figure out what I can do about GCN code.  I'm thinking about having the code alternate between two variables in case it has any read/write timing conflicts.  On the other hand, it probably won't do much more than add another GPR and a few more ALUs.  Considering its instruction execution timing, it probably won't matter.  Too many theories to be pondering at the same time.  I'll rest on it and see what I can come up with.
If you have any documentation, I would really appreciate it.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
The problem, I believe, is that the GCN cards are built to take large vector counts and perform a single instruction on them all at once.  This is in contrast to the small vector count of VLIW which can perform large instructions quickly.  So, a straight-forward 16-vector miner with simple instructions should work better than a 4-vector miner with multiple instructions.  Granted, this is speculation, but from what I've seen of the hardware specs, it should hold true.
I tried it, and a 16 vector simplest possible mining kernel had performance which was, unfortunately, appalling. GCN with SDK 2.6 (the only one it works with) really does not want any vectors at all.
Hmm, interesting.  Did you drop the worksize in your tests as well?
Absolutely. I tried all sorts of combinations. Specifically the optimum, as always, is using the card's reported max_work_size and then dividing that by the vector size. It gave the least worst performance... but we're talking 20% of the performance of running no vectors.
sr. member
Activity: 378
Merit: 250
The problem, I believe, is that the GCN cards are built to take large vector counts and perform a single instruction on them all at once.  This is in contrast to the small vector count of VLIW which can perform large instructions quickly.  So, a straight-forward 16-vector miner with simple instructions should work better than a 4-vector miner with multiple instructions.  Granted, this is speculation, but from what I've seen of the hardware specs, it should hold true.
I tried it, and a 16 vector simplest possible mining kernel had performance which was, unfortunately, appalling. GCN with SDK 2.6 (the only one it works with) really does not want any vectors at all.
Hmm, interesting.  Did you drop the worksize in your tests as well?
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
The problem, I believe, is that the GCN cards are built to take large vector counts and perform a single instruction on them all at once.  This is in contrast to the small vector count of VLIW which can perform large instructions quickly.  So, a straight-forward 16-vector miner with simple instructions should work better than a 4-vector miner with multiple instructions.  Granted, this is speculation, but from what I've seen of the hardware specs, it should hold true.
I tried it, and a 16 vector simplest possible mining kernel had performance which was, unfortunately, appalling. GCN with SDK 2.6 (the only one it works with) really does not want any vectors at all.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
New release of cgminer 2.2.5 with fresh kernel and no more zero binary error, coping with multiple different cards at last and working well with sdk 2.6.

7970 running at 1200/1050+5% is getting 714 Mhash with -I 11 and cgminer 2.2.5 defaults.
sr. member
Activity: 378
Merit: 250
Working on the 7970 tuning, I have tried to port both the diapolo and diablo kernels to cgminer. Alas neither of them are actually working yet, so instead I started modifying the existing poclbm kernel in cgminer to improve throughput. This should work on other GPUs as well as the 7970, but I have no idea if it's better or worse than phatk.

When it's released it will get a new date/version number, but I haven't changed the number right now so that people can download it now and give it a try:
https://raw.github.com/ckolivas/cgminer/kernels/poclbm120203.cl

Remember to delete any .bin files and if you're not on a 7970 with the latest cgminer, you'll have to tell it to use that kernel with -k poclbm.

So what's the damage? Well on the 7970 at 1200/1050 clocks, which was getting 694MHash, it's now getting 711Mhash. The 7970 has this unusual behaviour where the hashrate slowly rises for the first 5-10 minutes.
The problem, I believe, is that the GCN cards are built to take large vector counts and perform a single instruction on them all at once.  This is in contrast to the small vector count of VLIW which can perform large instructions quickly.  So, a straight-forward 16-vector miner with simple instructions should work better than a 4-vector miner with multiple instructions.  Granted, this is speculation, but from what I've seen of the hardware specs, it should hold true.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Working on the 7970 tuning, I have tried to port both the diapolo and diablo kernels to cgminer. Alas neither of them are actually working yet, so instead I started modifying the existing poclbm kernel in cgminer to improve throughput. This should work on other GPUs as well as the 7970, but I have no idea if it's better or worse than phatk.

When it's released it will get a new date/version number, but I haven't changed the number right now so that people can download it now and give it a try:
https://raw.github.com/ckolivas/cgminer/kernels/poclbm120203.cl

Remember to delete any .bin files and if you're not on a 7970 with the latest cgminer, you'll have to tell it to use that kernel with -k poclbm.

So what's the damage? Well on the 7970 at 1200/1050 clocks, which was getting 694MHash, it's now getting 711Mhash. The 7970 has this unusual behaviour where the hashrate slowly rises for the first 5-10 minutes.
newbie
Activity: 22
Merit: 0
Many thanks to ck the hard work and everyone else who donated!

I just changed over to 2.2.3 on my windows machine with two shiny new Sapphire 7970s and am getting 513 and 537 at stock stock speeds (running at a leisurely 78 degrees), a huge jump over 2.2.1.

Many thanks!


Excellent. Don't forget to use intensity 11 if you can.

I use dynamic on my main card and have been using 14 on the second as it's a Crossfire only card (I haven't progressed to the whole eyefinity 2 x 6 thang yet Tongue). I've changed to 11 and it seems to have basically the same rate. Coolness!

-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Many thanks to ck the hard work and everyone else who donated!

I just changed over to 2.2.3 on my windows machine with two shiny new Sapphire 7970s and am getting 513 and 537 at stock stock speeds (running at a leisurely 78 degrees), a huge jump over 2.2.1.

Many thanks!


Excellent. Don't forget to use intensity 11 if you can.
newbie
Activity: 22
Merit: 0
Many thanks to ck the hard work and everyone else who donated!

I just changed over to 2.2.3 on my windows machine with two shiny new Sapphire 7970s and am getting 513 and 537 at stock stock speeds (running at a leisurely 78 degrees), a huge jump over 2.2.1.

Many thanks!

-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/


Looks like I will have to figure out how to get this thing working on linuxcoin...

I imagine it is something like

curl somespecialfile
edit xorg.conf directly

aticonfig initialize

?
See my post here:
https://bitcointalksearch.org/topic/m.738468

That has the basics.
Feel free to ask about anything else. After the effort you put into getting the funds together and card to me, you deserve all the software support you need. As does sveetsnelda for his excellent generosity and everyone else who donated.
legendary
Activity: 1876
Merit: 1000


Looks like I will have to figure out how to get this thing working on linuxcoin...

I imagine it is something like

curl somespecialfile
edit xorg.conf directly

aticonfig initialize

?
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
And just for completeness, after the intensity investigation, I bumped it back up to the 1200/1050(+5%) clockspeed and put intensity up to 11 which seems the sweet spot. The final performance after it stabilised is:
Code:
GPU 0:  72.0C 4324RPM | 695.1/695.3Mh/s | A:98 R:1 HW:0 U: 9.79/m I:11
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Ck:  can you do the comparisons with the clock not set so high.  at 1200 they run pretty hot with a high fan, also if you just do down about 50 on the clock, you can reduce the voltage down to 1120 or so.

I am currently running   at volt: 1100  clock: 1100   650 Mhash diablo

At your clock settings of 1100 (+5% powertune) I'm getting 638 Mhash. Voltage doesn't change on linux with ADL it seems.
Doing:
Code:
export GPU_USE_SYNC_OBJECTS=1
was mandatory for getting rid of the CPU usage bug. Interestingly with that feature enabled, there is an inflexion point in intensity when the CPU usage jumps up dramatically. With the cgminer default of 2 GPU threads, it is low CPU usage up to intensity 11. With 1 GPU thread, it is low CPU usage up to intensity 13. Both seem to provide only about 1MH difference.
Pages:
Jump to: