My initial Radeon HD 7970 mining benchmarks - page 7.

project10

newbie

Activity: 8

Merit: 0

Thanks for the numbers, I've got one on the way.

1862

newbie

Activity: 65

Merit: 0

Quote from: The King on January 09, 2012, 02:17:14 PM

Quote from: 1onevvolf on January 09, 2012, 02:15:39 PM

Quote from: luo demin on January 09, 2012, 01:26:28 PM

I can't wait till the 7990 that is going to be impressive but expensive Sad

I might have missed this but what is the heat like hashing overclocked ? and what fan speed

Overclocked @ 1125/975MHz with automatic fan speed I'm getting temperatures hovering 81-83C, and the fan runs at 47-49% speed. You can see some screencaps on one of the earlier pages. But since I prefer lower temperatures and am worried about VRM and memory temps not yet being reported by GPU-Z, I usually run it at 60% fan speed and get temps around 72C. The blower fan at 60% speed is quite loud (its a reference design from Sapphire).

At 100% fan speed, the overclocked card gets below 60C while mining but you can hear it from outside of the house at this point Tongue

, so as lovely as these temps are this is not an option for me as it is also my gaming and work PC.

Yeah, so they still have not fixed that damn reference fan design. Aftermarket coolers FTW !

Damn ATI and their crap loud fan designs Sad

Could be worse, I have a GTX460 that makes me want to tear my hair out Undecided

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

EPiSKiNG

legendary

Activity: 800

Merit: 1001

anyone care to update https://en.bitcoin.it/wiki/Mining_hardware_comparison with the new findings??

-EP

DeathAndTaxes

donator

Activity: 1218

Merit: 1080

Gerald Davis

EPiSKiNG

legendary

Activity: 800

Merit: 1001

1onevvolf

newbie

Activity: 43

Merit: 0

I've measured my system and these are the results:

	Stock (925/1375MHz)	Overclocked (1125/975MHz)
Mining :	371 W @ 550MH/s	385 W @ 670MH/s
Idle :	118 W	118 W
Difference_(gfx_card_W):	253 W	267 W
MH/J_(system) :	1.48	1.74
MH/J_(gfx_card_only) :	2.17	2.51
MH/$_(gfx_card_only) :	1.00	1.22

(MH/$ estimated using lowest listed price for HD 7970 on amazon.com today)

EPiSKiNG

legendary

Activity: 800

Merit: 1001

Quote from: 1onevvolf on January 10, 2012, 08:17:31 AM

Up to 670MH/s @ 1125/975Mhz now with the driver AMD published yesterday! And I finally managed to find a wattmeter, so you can expect some measurements later today when I get back home from work

Yay!!

1onevvolf

newbie

Activity: 43

Merit: 0

Up to 670MH/s @ 1125/975Mhz now with the driver AMD published yesterday! And I finally managed to find a wattmeter, so you can expect some measurements later today when I get back home from work

chromeguy

newbie

Activity: 28

Merit: 0

666+mh/s
some hardcore guys used liquid nitrogen cooling and overclocked it by 84%

celcoid

member

Activity: 280

Merit: 10

So what do you think we can get out of this card being 100% optimistic? How much is it limited by the current best software solution.

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

Quote from: 1onevvolf on January 09, 2012, 05:59:29 PM

Code:

int16 selection = XG2 == (x)(0x136032ED);
if (any(selection))
{
x mask = Xnonce & 0xF;
x temp = shuffle(select(Xnonce, 0, selection), mask);
vstore16(temp, 0, output);
}

That "if" might be totally unneccesary, and I still don't quite understand how the output array works, but it might give you a better idea of what I was trying to do to avoid all those branches.

Quote from: DiabloD3 on January 09, 2012, 03:30:21 PM

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

I'll be watching the repository then

It should almost definitely help with more modern CPUs and Larrabee/Intel MIC.

The output array is basically a massive hack to prevent multiple outputs from hitting each other, although the chances of getting multiple outputs is extremely low. The size of the array now is massive overkill, but it also seems to be a strangely optimum size for hardware.

Now, what would give me the most benefit is some way of sorting the outputs in a single cycle so that the pair of { nonce, H } could instantly give me the best nonce, and then only evaluate that. There seems to be no way to do this (and yes, I imply reverting that one bit of math so that H == 0 is literally done at the end again, makes it much easier to sort on shit). The nonces themselves can't be sorted because its completely random, they're meaningless values essentially.

1onevvolf

newbie

Activity: 43

Merit: 0

Quote from: DiabloD3 on January 09, 2012, 03:30:21 PM

The branching has ended up becoming the best outcome. It can evaluate those branches in parallel, and you can't easily optimize away branches for memory writes (and theres apparently like 2 or 3 good tricks to get rid of branch waste, its just none of them work on memory writes).

That's interesting. Branches always seemed to be published as anathema to well performing kernels. I guess it all depends on how much work is being done inside. For small vector sizes there are few ifs, but with uint16 there are quite a few, so it might be worth investigating there.

Quote from: DiabloD3 on January 09, 2012, 03:30:21 PM

I should look at shuffle. Your way doesn't quite work though, vstore would output H !=0 hashes, which would trigger HW error alerts (and rightfully so) in the host code, and I consider the HW error tracking important. At least, assuming I'm reading that code right, anyways.

Yes, I'm still getting HW alerts and haven't quite worked them out yet. I posted the snippet earlier from memory and missed a couple of steps. The latest (broken) code I'm working with looks like this:

Code:

int16 selection = XG2 == (x)(0x136032ED);
if (any(selection))
{
x mask = Xnonce & 0xF;
x temp = shuffle(select(Xnonce, 0, selection), mask);
vstore16(temp, 0, output);
}

That "if" might be totally unneccesary, and I still don't quite understand how the output array works, but it might give you a better idea of what I was trying to do to avoid all those branches.

Quote from: DiabloD3 on January 09, 2012, 03:30:21 PM

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

I'll be watching the repository then

It should almost definitely help with more modern CPUs and Larrabee/Intel MIC.

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

Quote from: yabadaba on January 09, 2012, 03:32:05 PM

Quote from: DiabloD3

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

So does that mean that is the best for 5870 cards ? Or stick to 2.1 or 2.4 ? I am quite confused as to what the best SDK / ati driver combo is ATM.

Notice I said CPU not GPU. CPU mining still sucks altogether. 2.1 is still best for 58xx cards.

Fiyasko

legendary

Activity: 1428

Merit: 1001

Okey Dokey Lokey

Quote from: terrytibbs on January 09, 2012, 02:14:07 PM

Quote from: ?? on ??

okay, i will fly to Singapore and pick one up if it all makes you happy....

i got a girl there:P

Is it Mrs. Zhou Tong?

"Aadamm, what the hell did you DO?, The whole buildings on alert!" "A PANIC ROOM SHES GOD A GODDAMN PANIC ROOM!" "YEA WELL SO DO I ADAM!!"

yabadaba

newbie

Activity: 8

Merit: 0

Quote from: DiabloD3 on January 09, 2012, 03:30:21 PM

Quote from: 1onevvolf on January 09, 2012, 02:02:03 PM

Quote from: DiabloD3 on January 09, 2012, 01:10:19 PM

Wait wait wait. Are we sure uint16 is such a good idea? Last time I tried >4 (which was before 2.6, btw, I haven't tested with 2.6), it would crash in the compiler. Also, does anyone have a count on the number of registers per CU? There might not be enough registers to handle that.

I'm not sure if it's a good idea or not so I wanted to measure it Wink

GCN has 64KB worth of registers per CU, and like you said I'm not sure if that's enough. The reason for my curiosity was because GCN's compute units each contain 4 x SIMD units with a width of 16 elements (same size as Larrabee & Intel's MIC, coincidentally), and I recall reading somewhere that each of these SIMD units can retire one 16-way instruction every 4 cycles, so those 16element vectors kind of rang out at me. I also wanted to get familiar with the OpenCL bitcoin mining code and thought it would be a neat exercise (which it was!). Nice code by the way.

I can say for sure that 16element vectors DO compile with the drivers that came with the card.

The -ds code dump for 16 element vectors came out nice and clean, although the last few lines where the result is stored in output seem a bit branchy. It looks something like this:

Code:

if(XG2.s0 == 0x136032ED) { output[Xnonce.s0 & 0xF] = Xnonce.s0; }
if(XG2.s1 == 0x136032ED) { output[Xnonce.s1 & 0xF] = Xnonce.s1; }
if(XG2.s2 == 0x136032ED) { output[Xnonce.s2 & 0xF] = Xnonce.s2; }
...
...
if(XG2.sd == 0x136032ED) { output[Xnonce.sd & 0xF] = Xnonce.sd; }
if(XG2.se == 0x136032ED) { output[Xnonce.se & 0xF] = Xnonce.se; }
if(XG2.sf == 0x136032ED) { output[Xnonce.sf & 0xF] = Xnonce.sf; }

I tried replacing it with a branch-less expression using shuffle() and vstore16() but haven't managed to get it working. What I've come up with looks something like this:

Code:

x mask = Xnonce & 0xF;
x temp = shuffle(select(Xnonce, 0, selection), mask);
vstore16(temp, 0, output);

Anyhow I'm sure that my code modifications are doing all sorts of dumb things. I'm still learning how it all works so please ignore.

Quote from: DiabloD3 on January 09, 2012, 01:10:19 PM

Also, check some of the larger -vs, -v 40 is two sets of uint4 and -v 44 does three uint4s (unlike cgminer, -v 4 does two uint2s).

I've tried all of the different -v settings available (according to the source) but haven't been able to get any higher than the 666MH/s with the default settings and 3 compute threads.

The branching has ended up becoming the best outcome. It can evaluate those branches in parallel, and you can't easily optimize away branches for memory writes (and theres apparently like 2 or 3 good tricks to get rid of branch waste, its just none of them work on memory writes).

I should look at shuffle. Your way doesn't quite work though, vstore would output H !=0 hashes, which would trigger HW error alerts (and rightfully so) in the host code, and I consider the HW error tracking important. At least, assuming I'm reading that code right, anyways.

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

So does that mean that is the best for 5870 cards ? Or stick to 2.1 or 2.4 ? I am quite confused as to what the best SDK / ati driver combo is ATM.

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

Quote from: 1onevvolf on January 09, 2012, 02:02:03 PM

Quote from: DiabloD3 on January 09, 2012, 01:10:19 PM

Wait wait wait. Are we sure uint16 is such a good idea? Last time I tried >4 (which was before 2.6, btw, I haven't tested with 2.6), it would crash in the compiler. Also, does anyone have a count on the number of registers per CU? There might not be enough registers to handle that.

I'm not sure if it's a good idea or not so I wanted to measure it Wink

GCN has 64KB worth of registers per CU, and like you said I'm not sure if that's enough. The reason for my curiosity was because GCN's compute units each contain 4 x SIMD units with a width of 16 elements (same size as Larrabee & Intel's MIC, coincidentally), and I recall reading somewhere that each of these SIMD units can retire one 16-way instruction every 4 cycles, so those 16element vectors kind of rang out at me. I also wanted to get familiar with the OpenCL bitcoin mining code and thought it would be a neat exercise (which it was!). Nice code by the way.

I can say for sure that 16element vectors DO compile with the drivers that came with the card.

The -ds code dump for 16 element vectors came out nice and clean, although the last few lines where the result is stored in output seem a bit branchy. It looks something like this:

Code:

if(XG2.s0 == 0x136032ED) { output[Xnonce.s0 & 0xF] = Xnonce.s0; }
if(XG2.s1 == 0x136032ED) { output[Xnonce.s1 & 0xF] = Xnonce.s1; }
if(XG2.s2 == 0x136032ED) { output[Xnonce.s2 & 0xF] = Xnonce.s2; }
...
...
if(XG2.sd == 0x136032ED) { output[Xnonce.sd & 0xF] = Xnonce.sd; }
if(XG2.se == 0x136032ED) { output[Xnonce.se & 0xF] = Xnonce.se; }
if(XG2.sf == 0x136032ED) { output[Xnonce.sf & 0xF] = Xnonce.sf; }

I tried replacing it with a branch-less expression using shuffle() and vstore16() but haven't managed to get it working. What I've come up with looks something like this:

Code:

x mask = Xnonce & 0xF;
x temp = shuffle(select(Xnonce, 0, selection), mask);
vstore16(temp, 0, output);

Anyhow I'm sure that my code modifications are doing all sorts of dumb things. I'm still learning how it all works so please ignore.

Quote from: DiabloD3 on January 09, 2012, 01:10:19 PM

Also, check some of the larger -vs, -v 40 is two sets of uint4 and -v 44 does three uint4s (unlike cgminer, -v 4 does two uint2s).

I've tried all of the different -v settings available (according to the source) but haven't been able to get any higher than the 666MH/s with the default settings and 3 compute threads.

The branching has ended up becoming the best outcome. It can evaluate those branches in parallel, and you can't easily optimize away branches for memory writes (and theres apparently like 2 or 3 good tricks to get rid of branch waste, its just none of them work on memory writes).

I should look at shuffle. Your way doesn't quite work though, vstore would output H !=0 hashes, which would trigger HW error alerts (and rightfully so) in the host code, and I consider the HW error tracking important. At least, assuming I'm reading that code right, anyways.

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

stick_wood

newbie

Activity: 8

Merit: 0

Really nice cards and performance but the price really sucks !

5XXX is much more cost effective ATM. That may change in the future.

Maybe wait for FPGA ?

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

Quote from: 1onevvolf on January 09, 2012, 12:55:18 PM

Quote from: DeathAndTaxes on January 09, 2012, 10:20:35 AM

Hey OP do you have a kill-a-watt you could purchase locally. If you are in the states Home Depot and Lowes carry them. If you can find one locally I am sure we could get together the 3 or 4 BTC to get some accurate power readings.

The kill-a-watt brand doesn't appear to be commercialized here in europe, and I've been searching for an equivalent device locally each time I've had a chance to head out to a store for the past couple of days, but no luck so far.

I also took a stab at modifying DiabloMiner and managed to get it to use 16component vectors, which is what GCN is supposed to be tuned for, but performance isn't what I expect and its really hard to profile/debug the tahiti since I could not find any development tools that specificly support it yet.

BTW, they do make 240v/50hz euro Killawatts, but you might have to order it from the US. They also make 240v/60hz (double hot, like ovens and water heaters) ones and 208v ones for DC shit. Might have to look around, I love mine, its been essential for planning stuff out.

The King

newbie

Activity: 13

Merit: 0

Quote from: 1onevvolf on January 09, 2012, 02:15:39 PM

Quote from: luo demin on January 09, 2012, 01:26:28 PM

I can't wait till the 7990 that is going to be impressive but expensive Sad

I might have missed this but what is the heat like hashing overclocked ? and what fan speed

Overclocked @ 1125/975MHz with automatic fan speed I'm getting temperatures hovering 81-83C, and the fan runs at 47-49% speed. You can see some screencaps on one of the earlier pages. But since I prefer lower temperatures and am worried about VRM and memory temps not yet being reported by GPU-Z, I usually run it at 60% fan speed and get temps around 72C. The blower fan at 60% speed is quite loud (its a reference design from Sapphire).

At 100% fan speed, the overclocked card gets below 60C while mining but you can hear it from outside of the house at this point Tongue

, so as lovely as these temps are this is not an option for me as it is also my gaming and work PC.

Yeah, so they still have not fixed that damn reference fan design. Aftermarket coolers FTW !

Damn ATI and their crap loud fan designs Sad

Topic: My initial Radeon HD 7970 mining benchmarks - page 7. (Read 46846 times)