Author

Topic: GPU chip degradation (Read 4968 times)

hero member
Activity: 518
Merit: 500
February 27, 2012, 03:23:51 AM
#17
Do the effects of electromigration progress over time or is it just an instant thing where the card simply gives up after running great immediately before the total cut in the interconnectors?

It can be either. Electromigration may cause a slow decline of maximum stable OC, particularly if the affected logic thats being degraded is in the critical path, or it may cause a sudden catastrophic failure that appears to come out of the blue
full member
Activity: 126
Merit: 100
February 27, 2012, 02:41:03 AM
#16
If you want to get a fairly quick understanding on electromigration and how atomic wire structure (chemical) changes as result of temperature and electrical current, there is a pdf document from some researchers on semiconductors at Arizona State University that is informative:

http://schroder.personal.asu.edu/Electromigration.pdf

In a nutshell:
Pretty neat that it mentions how hardware failure may take a very long time under "normal operating conditions," but is temperature dependent --> increasing temperature from 30 Celsius to 250 Celsius results in hardware lasting 30 seconds (example); in addition integrated circuits today use copper which does not electromigrate as fast as aluminum; the smaller the chip (40 nm, 32 nm, 28 nm, etc) the more prone it is to electromigration, because a higher amount of current is running through a thinner wire!
full member
Activity: 210
Merit: 100
February 26, 2012, 09:03:40 PM
#15
Do the effects of electromigration progress over time or is it just an instant thing where the card simply gives up after running great immediately before the total cut in the interconnectors?
If not, I believe it may be your own configuration issues with the drivers, OC, and etc.
Does the card still function fine at stock values?
The chip slowly degrades and requires higher voltage and/or lower clocks to be stable.
Yeah, this particular card passed hour long Furmark and OCCT tests so it's still in a pretty good shape, except it's now mining at 935 MHz instead of 942.

I haven't had a catastrophic card failure yet, like some users here. Lowering GPU voltage does seem to prolong cards' lives wonderfully, this ASUS is the only card I could not undervolt.
All GPU fans are ok (running at 60% RPM max), though I wore out two case fans (they started rattling/buzzing). Those case fans were running at 100%
sr. member
Activity: 373
Merit: 250
February 26, 2012, 08:45:02 PM
#14
Do the effects of electromigration progress over time or is it just an instant thing where the card simply gives up after running great immediately before the total cut in the interconnectors?

If not, I believe it may be your own configuration issues with the drivers, OC, and etc.

Does the card still function fine at stock values?
legendary
Activity: 2058
Merit: 1452
February 26, 2012, 07:03:46 PM
#13
I take it you're running your DCIIs in a caseless rig?
Yes. Open Mining stand built from a $6 wooden Walmart Shoe rack...lol

Here is an older pic of the setup for those 2 cards.....a bit different now (settings) but the stand is the same.

GPI #0 & #1 on the 6960MINER are the Asus DCUII cards.

http://members.shaw.ca/bitlane/open_miners2.jpg
you really like to piss away money + power on quad core cpus.
sr. member
Activity: 462
Merit: 250
I heart thebaron
February 18, 2012, 08:55:57 PM
#12
I take it you're running your DCIIs in a caseless rig?
Yes. Open Mining stand built from a $6 wooden Walmart Shoe rack...lol

Here is an older pic of the setup for those 2 cards.....a bit different now (settings) but the stand is the same.

GPI #0 & #1 on the 6960MINER are the Asus DCUII cards.

full member
Activity: 210
Merit: 100
February 18, 2012, 07:19:58 PM
#11
It's unfortunate that you had to run the card THAT hard to get performance out of it that satisfied you.

My (2) Asus 1GB DCUII 6950's run locked shaders and with stock voltage @ 900Mhz, get 374 MH/s without breaking a sweat (40% Fan Speed, 67C).
I'm just not sure the extra 15 MH/s would be worth the clocks/voltage/shortened life to achieve....for me anyway..

Good luck with it all.
What you say is true.
However, since I can't undervolt that card to achieve superior Mhash/W my next best bet is to push it as hard as possible to offset the (very significant in the case of this machine) system power usage: 8 TB of storage is quite power hungry, as is a four-core Phenom II.
Power usage rises linearly with core clock so I'm ok here, except for the temperature which isn't THAT bad. 52% fan speed is relatively quiet.

The idea of upgrading the server with mining GPUs was that the machine would pay for its energy footprint.
What I should really do, is change the case to something better ventilated, like the rv-02 I'm using for my dedicated rigs.

I take it you're running your DCIIs in a caseless rig?
sr. member
Activity: 462
Merit: 250
I heart thebaron
February 18, 2012, 05:28:22 PM
#10
My ASUS 6950 DCII 1GB seems to be affected at last(1).

The card has been hashing since early September. Since late October the core speed was 942 MHz resulting in 389 MHash/s.

It's unfortunate that you had to run the card THAT hard to get performance out of it that satisfied you.

My (2) Asus 1GB DCUII 6950's run locked shaders and with stock voltage @ 900Mhz, get 374 MH/s without breaking a sweat (40% Fan Speed, 67C).
I'm just not sure the extra 15 MH/s would be worth the clocks/voltage/shortened life to achieve....for me anyway..

Good luck with it all.
full member
Activity: 210
Merit: 100
February 18, 2012, 07:36:05 AM
#9
...
77C is way above my comfort zone. Getting it down 10C will double your cards life expectancy.
If you have room for a triple slot cooler, I would consider buying one.
...
I know, P4, thanks for your concern.
No action needs to be taken, it's a highly overclocked 6950 with locked memory and voltage - of course it will get hot.

Did you know that the 69xx series was supposed to be manufactured in 32nm due to heat dissipation issues?
Obviously, the plan failed and AMD had to re-engineer those chips back into the 40nm process.

Actually, shame on me for purchasing a non-ref 6950 in the first place, but I wanted a quiet card since it's installed in my general purpose home server upgraded with mining capabilities.
It's right here in my study, I don't want to hear it howl when I'm in the bathroom  Smiley

The other card in this server is a 6750 (206 MHash/s) - an oddball combination of cards but case space limitations didn't allow for another large card and I got it very cheap.
Getting ablut 600 MHash with this machine which is not bad for a non-dedicated rig.

As I said, once the card can't pull 900 MHz I'll sell it provided it's still adequate for gaming.
hero member
Activity: 518
Merit: 500
February 18, 2012, 04:34:56 AM
#8
My ASUS 6950 DCII 1GB seems to be affected at last(1).

The card has been hashing since early September. Since late October the core speed was 942 MHz resulting in 389 MHash/s.
945 MHz would drop the card on its knees in a matter of hours. 942 MHz was stable for months... until this week.
The card hung a few days ago and twice yesterday.
935 MHz has been stable for a day now...

I was fully aware that setting the "cruise speed" so close to unstable clocks left me with no buffer zone - I did that more as an experiment into degradation progression than anything else.

I never liked this particular card much for its forced 125 MHz memdiff and locked core voltage resulting in power use of about 150W and temps in the high 70s.
Still, the degradation rate has been slower than I expected it to be.

Once the card is incapable of pulling 900 MHz I'll give it a second life in some gamer's rig provided no other issues emerge.

Just a data point for anyone interested in the subject.

In a word: electromigration.
You can slow the aging process by trying to reduce your temps. 77C is way above my comfort zone. Getting it down 10C will double your cards life expectancy.
If you have room for a triple slot cooler, I would consider buying one. Most of them are pretty universal, so if the card dies or becomes obsolete, just reuse it for another card.

FWIW, I just installed a Deepcool V6000 on an XFX 5870 @1 GHz with a dead fan.
I bought it for ~$30. Temps dropped from ~70C at deafening noise to 54C and you can not hear the thing with the supplied fans (fixed low speed, they dont connect to the videocard fan header and are therefore not variable, which I do find a problem, but it works well enough it seems).
full member
Activity: 155
Merit: 100
February 18, 2012, 04:15:29 AM
#7
Before I have had the same problem with one HD4670. I also thought that the card is gone, but I was wrong. A few weeks I used the card on lower and lower frequency until I replace the PSU with new one - the problem disappear.

If you can, try with another PSU.
legendary
Activity: 2058
Merit: 1452
February 16, 2012, 08:25:32 PM
#6
So quick to judge...

WD-40 make a range of products. The traditional product is primarily a water displacer, true, but it *does* have lubricant properties. For the sort of quality fan that XFX put on their cards, it's probably adequate.

However the WD-40 silicone lubricant that I'm spraying into my fans seems to work.

As to 'destroying them quickly' - every unit that has received a rebuild from me, including the dreaded WD-40, that worked on re-connection is still working. And the fan speeds are not seizing up.

So whilst I respect your opinions, I'll go with my experience for the time being. If you're right, then hopefully I'd have switched over to FPGAs by the time the cards all blow up... Wink
WD-40 also likes to attract dust, which could worsen your fan's performance over the long run.
full member
Activity: 131
Merit: 100
February 16, 2012, 04:39:18 PM
#5
any card with a freely-turning fan gets a big spray of WD-40 into the fan internals (most of the GPU fans aren't serviceable, unfortunately, so can't get to the bearings and lubricate *properly*).

So much fail... wd40 is not a lubricant. It is a water displacer and has no lubrication properties. Spraying wd40 into a fan bearing will destroy them extremely fast.
sr. member
Activity: 362
Merit: 250
February 16, 2012, 02:50:39 PM
#4
WD-40 into the fan internals

That is a horrible thing to do. No wonder you have so many problems.  Roll Eyes
full member
Activity: 210
Merit: 100
February 16, 2012, 10:56:37 AM
#3
The temps are as steady as ever, lately even a couple centigrades lower due to the winter conditions.
There are no clogged filters or malfunctioning fans. Fan speed is at expected 2800 RPM.

I might launch some older cgminer version using the old phatk kernel but honestly I don't think I care to.

Excerpt from my thermal log:
Code:
2012-02-16  16:02  local
( other stats edited out )
[c0c]    940           820
[c0t]    77.50 C
[c0u]    99%
[c0f]    52%

let's compare that with an entry from two months ago:
Code:
2011-12-16  21:02  local
(...)
[c0c]    942           820
[c0t]    78.50 C
[c0u]    99%
[c0f]    52%

c0 as in Cayman0; c for clocks, t for temp, u for usage, f for fan. Exact RPM speed needed to be checked in cgminer.
This arrangement allows me to just grep the desired parameters from the weekly-rotated thermal log.
GPU parameters are logged every 6 minutes, every hour additional info such as CPU temperature, date and time is being gathered.
member
Activity: 121
Merit: 10
February 16, 2012, 10:45:38 AM
#2
Are the temps same now as always? As that might also explain the sudden unstability.

Can't say I've personally ran into this degradation, I thought it only really happens with large-ish overvolting in new hardware. Not that I would be too surprised, chips are individuals  Smiley
full member
Activity: 210
Merit: 100
February 16, 2012, 07:16:38 AM
#1
My ASUS 6950 DCII 1GB seems to be affected at last(1).

The card has been hashing since early September. Since late October the core speed was 942 MHz resulting in 389 MHash/s.
945 MHz would drop the card on its knees in a matter of hours. 942 MHz was stable for months... until this week.
The card hung a few days ago and twice yesterday.
935 MHz has been stable for a day now...

I was fully aware that setting the "cruise speed" so close to unstable clocks left me with no buffer zone - I did that more as an experiment into degradation progression than anything else.

I never liked this particular card much for its forced 125 MHz memdiff and locked core voltage resulting in power use of about 150W and temps in the high 70s.
Still, the degradation rate has been slower than I expected it to be.

Once the card is incapable of pulling 900 MHz I'll give it a second life in some gamer's rig provided no other issues emerge.

Just a data point for anyone interested in the subject.


Notes:
(1) Unless its just the latest cgminer pushing it harder than previous versions. The hash speed hasn't changed so I don't think the miner is responsible for the loss of stability.
     I don't feel like digging deeper into the matter or trying different kernels, I'll just drop clocks until stability is regained.
Jump to: