Pages:
Author

Topic: My initial Radeon HD 7970 mining benchmarks - page 7. (Read 46819 times)

newbie
Activity: 8
Merit: 0
Thanks for the numbers, I've got one on the way.
newbie
Activity: 65
Merit: 0
I can't wait till the 7990 that is going to be impressive but expensive  Sad I might have missed this but what is the heat like hashing overclocked ? and what fan speed

Overclocked @ 1125/975MHz with automatic fan speed I'm getting temperatures hovering 81-83C, and the fan runs at 47-49% speed. You can see some screencaps on one of the earlier pages. But since I prefer lower temperatures and am worried about VRM and memory temps not yet being reported by GPU-Z, I usually run it at 60% fan speed and get temps around 72C. The blower fan at 60% speed is quite loud (its a reference design from Sapphire).

At 100% fan speed, the overclocked card gets below 60C while mining but you can hear it from outside of the house at this point Tongue, so as lovely as these temps are this is not an option for me as it is also my gaming and work PC.

Yeah, so they still have not fixed that damn reference fan design. Aftermarket coolers FTW !

Damn ATI and their crap loud fan designs Sad

Could be worse, I have a GTX460 that makes me want to tear my hair out Undecided
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
legendary
Activity: 800
Merit: 1001
anyone care to update https://en.bitcoin.it/wiki/Mining_hardware_comparison  with the new findings??

-EP
donator
Activity: 1218
Merit: 1079
Gerald Davis
legendary
Activity: 800
Merit: 1001
newbie
Activity: 43
Merit: 0
I've measured my system and these are the results:

                        Stock (925/1375MHz)Overclocked (1125/975MHz)
Mining                        :371 W @ 550MH/s385 W @ 670MH/s
Idle                            :118 W118 W
Difference_(gfx_card_W):253 W267 W
MH/J_(system)             :1.481.74
MH/J_(gfx_card_only)    :2.172.51
MH/$_(gfx_card_only)   :1.001.22

(MH/$ estimated using lowest listed price for HD 7970 on amazon.com today)
legendary
Activity: 800
Merit: 1001
Up to 670MH/s @ 1125/975Mhz now with the driver AMD published yesterday! And I finally managed to find a wattmeter, so you can expect some measurements later today when I get back home from work Smiley


Yay!!
newbie
Activity: 43
Merit: 0
Up to 670MH/s @ 1125/975Mhz now with the driver AMD published yesterday! And I finally managed to find a wattmeter, so you can expect some measurements later today when I get back home from work Smiley
newbie
Activity: 28
Merit: 0
666+mh/s
some hardcore guys used liquid nitrogen cooling and overclocked it by 84%
member
Activity: 280
Merit: 10
So what do you think we can get out of this card being 100% optimistic? How much is it limited by the current best software solution.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
Code:
    int16 selection = XG2 == (x)(0x136032ED);
    if (any(selection))
    {
       x mask = Xnonce & 0xF;
       x temp = shuffle(select(Xnonce, 0, selection), mask);
       vstore16(temp, 0, output);
    }

That "if" might be totally unneccesary, and I still don't quite understand how the output array works, but it might give you a better idea of what I was trying to do to avoid all those branches.

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

I'll be watching the repository then Smiley It should almost definitely help with more modern CPUs and Larrabee/Intel MIC.

The output array is basically a massive hack to prevent multiple outputs from hitting each other, although the chances of getting multiple outputs is extremely low. The size of the array now is massive overkill, but it also seems to be a strangely optimum size for hardware.

Now, what would give me the most benefit is some way of sorting the outputs in a single cycle so that the pair of { nonce, H } could instantly give me the best nonce, and then only evaluate that. There seems to be no way to do this (and yes, I imply reverting that one bit of math so that H == 0 is literally done at the end again, makes it much easier to sort on shit). The nonces themselves can't be sorted because its completely random, they're meaningless values essentially.
newbie
Activity: 43
Merit: 0
The branching has ended up becoming the best outcome. It can evaluate those branches in parallel, and you can't easily optimize away branches for memory writes (and theres apparently like 2 or 3 good tricks to get rid of branch waste, its just none of them work on memory writes).

That's interesting. Branches always seemed to be published as anathema to well performing kernels. I guess it all depends on how much work is being done inside. For small vector sizes there are few ifs, but with uint16 there are quite a few, so it might be worth investigating there.

I should look at shuffle. Your way doesn't quite work though, vstore would output H !=0 hashes, which would trigger HW error alerts (and rightfully so) in the host code, and I consider the HW error tracking important. At least, assuming I'm reading that code right, anyways.

Yes, I'm still getting HW alerts and haven't quite worked them out yet. I posted the snippet earlier from memory and missed a couple of steps. The latest (broken) code I'm working with looks like this:

Code:
    int16 selection = XG2 == (x)(0x136032ED);
    if (any(selection))
    {
       x mask = Xnonce & 0xF;
       x temp = shuffle(select(Xnonce, 0, selection), mask);
       vstore16(temp, 0, output);
    }

That "if" might be totally unneccesary, and I still don't quite understand how the output array works, but it might give you a better idea of what I was trying to do to avoid all those branches.

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

I'll be watching the repository then Smiley It should almost definitely help with more modern CPUs and Larrabee/Intel MIC.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
Quote from: DiabloD3
I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

So does that mean that is the best for 5870 cards ? Or stick to 2.1 or 2.4 ? I am quite confused as to what the best SDK / ati driver combo is ATM.

Notice I said CPU not GPU. CPU mining still sucks altogether. 2.1 is still best for 58xx cards.
legendary
Activity: 1428
Merit: 1001
Okey Dokey Lokey
okay, i will fly to Singapore and pick one up if it all makes you happy....


i got a girl there:P
Is it Mrs. Zhou Tong?

"Aadamm, what the hell did you DO?, The whole buildings on alert!" "A PANIC ROOM SHES GOD A GODDAMN PANIC ROOM!" "YEA WELL SO DO I ADAM!!"
newbie
Activity: 8
Merit: 0
Wait wait wait. Are we sure uint16 is such a good idea? Last time I tried >4 (which was before 2.6, btw, I haven't tested with 2.6), it would crash in the compiler. Also, does anyone have a count on the number of registers per CU? There might not be enough registers to handle that.

I'm not sure if it's a good idea or not so I wanted to measure it Wink GCN has 64KB worth of registers per CU, and like you said I'm not sure if that's enough. The reason for my curiosity was because GCN's compute units each contain 4 x SIMD units with a width of 16 elements (same size as Larrabee & Intel's MIC, coincidentally), and I recall reading somewhere that each of these SIMD units can retire one 16-way instruction every 4 cycles, so those 16element vectors kind of rang out at me. I also wanted to get familiar with the OpenCL bitcoin mining code and thought it would be a neat exercise (which it was!). Nice code by the way.

I can say for sure that 16element vectors DO compile with the drivers that came with the card.

The -ds code dump for 16 element vectors came out nice and clean, although the last few lines where the result is stored in output seem a bit branchy. It looks something like this:

Code:
    if(XG2.s0 == 0x136032ED) { output[Xnonce.s0 & 0xF] = Xnonce.s0; }
    if(XG2.s1 == 0x136032ED) { output[Xnonce.s1 & 0xF] = Xnonce.s1; }
    if(XG2.s2 == 0x136032ED) { output[Xnonce.s2 & 0xF] = Xnonce.s2; }
    ...
    ...
    if(XG2.sd == 0x136032ED) { output[Xnonce.sd & 0xF] = Xnonce.sd; }
    if(XG2.se == 0x136032ED) { output[Xnonce.se & 0xF] = Xnonce.se; }
    if(XG2.sf == 0x136032ED) { output[Xnonce.sf & 0xF] = Xnonce.sf; }

I tried replacing it with a branch-less expression using shuffle() and vstore16() but haven't managed to get it working. What I've come up with looks something like this:

Code:
    x mask = Xnonce & 0xF;
    x temp = shuffle(select(Xnonce, 0, selection), mask);
    vstore16(temp, 0, output);

Anyhow I'm sure that my code modifications are doing all sorts of dumb things. I'm still learning how it all works so please ignore.

Also, check some of the larger -vs, -v 40 is two sets of uint4 and -v 44 does three uint4s (unlike cgminer, -v 4 does two uint2s).

I've tried all of the different -v settings available (according to the source) but haven't been able to get any higher than the 666MH/s with the default settings and 3 compute threads.

The branching has ended up becoming the best outcome. It can evaluate those branches in parallel, and you can't easily optimize away branches for memory writes (and theres apparently like 2 or 3 good tricks to get rid of branch waste, its just none of them work on memory writes).

I should look at shuffle. Your way doesn't quite work though, vstore would output H !=0 hashes, which would trigger HW error alerts (and rightfully so) in the host code, and I consider the HW error tracking important. At least, assuming I'm reading that code right, anyways.

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

So does that mean that is the best for 5870 cards ? Or stick to 2.1 or 2.4 ? I am quite confused as to what the best SDK / ati driver combo is ATM.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
Wait wait wait. Are we sure uint16 is such a good idea? Last time I tried >4 (which was before 2.6, btw, I haven't tested with 2.6), it would crash in the compiler. Also, does anyone have a count on the number of registers per CU? There might not be enough registers to handle that.

I'm not sure if it's a good idea or not so I wanted to measure it Wink GCN has 64KB worth of registers per CU, and like you said I'm not sure if that's enough. The reason for my curiosity was because GCN's compute units each contain 4 x SIMD units with a width of 16 elements (same size as Larrabee & Intel's MIC, coincidentally), and I recall reading somewhere that each of these SIMD units can retire one 16-way instruction every 4 cycles, so those 16element vectors kind of rang out at me. I also wanted to get familiar with the OpenCL bitcoin mining code and thought it would be a neat exercise (which it was!). Nice code by the way.

I can say for sure that 16element vectors DO compile with the drivers that came with the card.

The -ds code dump for 16 element vectors came out nice and clean, although the last few lines where the result is stored in output seem a bit branchy. It looks something like this:

Code:
    if(XG2.s0 == 0x136032ED) { output[Xnonce.s0 & 0xF] = Xnonce.s0; }
    if(XG2.s1 == 0x136032ED) { output[Xnonce.s1 & 0xF] = Xnonce.s1; }
    if(XG2.s2 == 0x136032ED) { output[Xnonce.s2 & 0xF] = Xnonce.s2; }
    ...
    ...
    if(XG2.sd == 0x136032ED) { output[Xnonce.sd & 0xF] = Xnonce.sd; }
    if(XG2.se == 0x136032ED) { output[Xnonce.se & 0xF] = Xnonce.se; }
    if(XG2.sf == 0x136032ED) { output[Xnonce.sf & 0xF] = Xnonce.sf; }

I tried replacing it with a branch-less expression using shuffle() and vstore16() but haven't managed to get it working. What I've come up with looks something like this:

Code:
    x mask = Xnonce & 0xF;
    x temp = shuffle(select(Xnonce, 0, selection), mask);
    vstore16(temp, 0, output);

Anyhow I'm sure that my code modifications are doing all sorts of dumb things. I'm still learning how it all works so please ignore.

Also, check some of the larger -vs, -v 40 is two sets of uint4 and -v 44 does three uint4s (unlike cgminer, -v 4 does two uint2s).

I've tried all of the different -v settings available (according to the source) but haven't been able to get any higher than the 666MH/s with the default settings and 3 compute threads.

The branching has ended up becoming the best outcome. It can evaluate those branches in parallel, and you can't easily optimize away branches for memory writes (and theres apparently like 2 or 3 good tricks to get rid of branch waste, its just none of them work on memory writes).

I should look at shuffle. Your way doesn't quite work though, vstore would output H !=0 hashes, which would trigger HW error alerts (and rightfully so) in the host code, and I consider the HW error tracking important. At least, assuming I'm reading that code right, anyways.

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.
newbie
Activity: 8
Merit: 0
Really nice cards and performance but the price really sucks !

5XXX is much more cost effective ATM. That may change in the future.

Maybe wait for FPGA ?
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
Hey OP do you have a kill-a-watt you could purchase locally.  If you are in the states Home Depot and Lowes carry them.  If you can find one locally I am sure we could get together the 3 or 4 BTC to get some accurate power readings.

The kill-a-watt brand doesn't appear to be commercialized here in europe, and I've been searching for an equivalent device locally each time I've had a chance to head out to a store for the past couple of days, but no luck so far.

I also took a stab at modifying DiabloMiner and managed to get it to use 16component vectors, which is what GCN is supposed to be tuned for, but performance isn't what I expect and its really hard to profile/debug the tahiti since I could not find any development tools that specificly support it yet.

BTW, they do make 240v/50hz euro Killawatts, but you might have to order it from the US. They also make 240v/60hz (double hot, like ovens and water heaters) ones and 208v ones for DC shit. Might have to look around, I love mine, its been essential for planning stuff out.
newbie
Activity: 13
Merit: 0
I can't wait till the 7990 that is going to be impressive but expensive  Sad I might have missed this but what is the heat like hashing overclocked ? and what fan speed

Overclocked @ 1125/975MHz with automatic fan speed I'm getting temperatures hovering 81-83C, and the fan runs at 47-49% speed. You can see some screencaps on one of the earlier pages. But since I prefer lower temperatures and am worried about VRM and memory temps not yet being reported by GPU-Z, I usually run it at 60% fan speed and get temps around 72C. The blower fan at 60% speed is quite loud (its a reference design from Sapphire).

At 100% fan speed, the overclocked card gets below 60C while mining but you can hear it from outside of the house at this point Tongue, so as lovely as these temps are this is not an option for me as it is also my gaming and work PC.

Yeah, so they still have not fixed that damn reference fan design. Aftermarket coolers FTW !

Damn ATI and their crap loud fan designs Sad
Pages:
Jump to: