Pages:
Author

Topic: Im convinced mining or some settings while mining destroys 5970's VRAM... =*( (Read 4845 times)

legendary
Activity: 1624
Merit: 1001
All cryptos are FIAT digital currency. Do not use.
I forgot to mention, I pulled the 5870 and replaced with a 6970.  The 5870 ran at 90C+ the 6970 which should run warmer sits at a happy 74C at 85% fan.  The 5870 was at 92C with 100% fan.  Both units use the same single-fan Powercolor heat sink.

The 6000s run cooler than the 5000s because they are normally more efficient with the power draw. I say "normally" because, cheap cooling, vrms and phases can and will make a stock 7870 run hotter than an overclocked 5850. lol
 
hero member
Activity: 1246
Merit: 501
I forgot to mention, I pulled the 5870 and replaced with a 6970.  The 5870 ran at 90C+ the 6970 which should run warmer sits at a happy 74C at 85% fan.  The 5870 was at 92C with 100% fan.  Both units use the same single-fan Powercolor heat sink.
legendary
Activity: 2450
Merit: 1002
As I said in another thread, my 5870 finally gave up yesterday.  It was an old, used Powercolor I got cheap off eBay.  I had to replace the TIM on it when I got it, as the person who owned it before had used some crap "silver" compound that basically ran off when it was heated.

After about 4 months of solid mining, I noticed the temp going up to 92C.  Checked the video output and it was suffering massive corruption.  Recleaned the TIM, made sure everything was 100%, but it still went to 92C almost instantly.  When cgminer stopped mining, temps dropped quickly, proving the heat sink was working OK.



Wow! now, Ive never seen that type of behaviour when a GPU / video card "dies" .. that suxxors!
As of a few months ago, I managed to get that severely damaged GPU workin somewhat reliably again on the 5970 - I lowered the threads count in cgminer to 1 and that GPU now only gets HW errors 10% of the time on avg rather than 50% of the time(think cuz its using slightly less RAM on one thread)
Also, my other 5970 which I thought still had good RAM is getting faulty now, had to lower its RAM from 300mhz after 2 years of mining down to 200mhz to prevent getting HW errors =( on one of its GPU's...

So, the official failure rate for my 5970's - 100% and these are on temps that are way more than acceptable(VRMs consistently below 90C, all other temps below 70c) NEVER overvolted, always undervolted GPU.
As for my 5800's - in my period of owning they have not deteriorated at all.
hero member
Activity: 1246
Merit: 501
As I said in another thread, my 5870 finally gave up yesterday.  It was an old, used Powercolor I got cheap off eBay.  I had to replace the TIM on it when I got it, as the person who owned it before had used some crap "silver" compound that basically ran off when it was heated.

After about 4 months of solid mining, I noticed the temp going up to 92C.  Checked the video output and it was suffering massive corruption.  Recleaned the TIM, made sure everything was 100%, but it still went to 92C almost instantly.  When cgminer stopped mining, temps dropped quickly, proving the heat sink was working OK.

hero member
Activity: 896
Merit: 1000
Undervolting can "potentially" cause issues, because...

Lower voltage = higher amps @ "any wattage"
Regulators direct voltage to "dump" it as heat. The higher the voltage, the less it is "dumping", thus, cooler regulators.


The wattage is determined by voltage. The lower the voltage, the lower the wattage. I agree with other parts of your arguments.
420
hero member
Activity: 756
Merit: 500
I think the lesson learned here is: a hot GPU is a dying GPU, Keep your GPU as cool as freaking possible by all means. Even if it doesn't seem to be running too hot.

how hot is too hot
sr. member
Activity: 406
Merit: 250
I think the lesson learned here is: a hot GPU is a dying GPU, Keep your GPU as cool as freaking possible by all means. Even if it doesn't seem to be running too hot.
hero member
Activity: 504
Merit: 500
Undervolting can "potentially" cause issues, because...

Lower voltage = higher amps @ "any wattage"
Regulators direct voltage to "dump" it as heat. The higher the voltage, the less it is "dumping", thus, cooler regulators.

However... using any card for "x-hours" is what ultimately kills it. When you use it in a game-system, you are using it intermittently. Thus, it may take three years to get one years worth of "circuit-wear" on the board. As opposed to using it "constant duty cycle", where a year of wear happens in 9-12 months.

Also, these are CMT chips. They use small amounts of "low lead" solder-paste, to mount the chips to the board. This type of solder is prone to "shearing", or "pulling-off", from cold-shock cycling. Thus, try applying a SMT solder-temp heat source to the chip to "rebond" the "cold-welds" that have pulled off the contacts. (That is the "x-box" oven trick. However, do NOT put these in your oven. Not all the components are "oven safe". The boards are later populated with non SMT components that will be destroyed if you attempt to oven-bake them.)

Also, as stated above... the heat from the GPU, is stupidly spread to the VRM's by the heat-sink, and the hot air is also blown directly onto all the components of the board. That will degrade capacitors, resistors, and any other non-sunk transistor that normally would not be exposed to these higher and constant heat levels.

Also, there is the issue of "bios corruption". Bios flashes are not that great. They still use cheap programmable memory, which decays in heat. (Thus the use of many "removable" and "reflashable" bios roms.) If the bios is corrupt, it can make it seem as if portions of the card are not functioning correctly, when they are perfectly fine.

However, this is the first time I have ever heard (with serious conviction), of "one card" having this issue. So, it may be realistically related to a design fault, or physical stress fault. As it has been confirmed over and over, that "mining", does not "cause damage". You don't have that much actual control over the cards components. With the exception of physical failures, which would kill an unmodified card, without any discrimination to user settings.

If there was a 100% "yes, this killed my card", then I would worry. However, I have barely seen 5% with this issue, which is actually indicative of "manufacturing error" and "physical hardware failure". Completely unrelated to "mining", other than the issue about "constant duty" and "cold shock", which happens when you play a game for 4-20 hours in a row also.

Not to mention, you have no idea if the cards had a previous bios-mod, or "driver glitches", that could have contributed to the shortness of life. (On anything other than the cards you purchased from the store directly.)

The number one killer of video-cards, is running them trapped inside of a heat-trap box, called a CPU tower. (That and water-coolers, which do not adequately cool the "rest of the components", and contribute to heat-stress-shearing, which fans reduce by keeping everything a constant temperature.)
hero member
Activity: 756
Merit: 501
There is more to Bitcoin than bitcoins.
did you follow Anti-static procedures/ power procedures when installing/removing them?
This may very well explain the syndrome. Periodic removal for cleaning without careful ESD management will eventually lead to damage. 
hero member
Activity: 784
Merit: 504
Dream become broken often
I've bought 2 5970 cards off ebay with the knowledge that they were already artifacting...he said he mined on them and they wouldn't hold up anymore...so i got them n put them in a bamt box...1 card mined for about a month before it was throwing off bamt so much i had to remove it...other card just kept chugging away...for giggles i put it in a machine and sure enough...couldn't make out anything on the screen trying to play a game...bought them around the new years i think...

well finally couple days ago i noticed my bamt box was acting funny...restarts...hangs...took out the other 5970 and it works like a charm again Sad but least i got some life outta them before they bit the dust...I don't think any video card is meant to be run 24/7 and with the 5970 so thats 2 vid cards crammed into a tiny space...just like ppl that laptop mine kill their laptops...just ain't meant to be abused like that...oh well...time to put them up on ebay n hope they sell for a decent price Cheesy
hero member
Activity: 896
Merit: 1000
Anyone mine LTC and get the same result?

My 5970 works on BTC, but not on LTC.
full member
Activity: 196
Merit: 100
Like I said, the damage is not seen in mining btc. It shows up when trying to do gaming =/

note to self: may be limited to bitcointalk.org marketplace when re-selling 5970's when mining with them becomes unprofitable

Run them at lower speeds and voltages and they will be fine even for gaming so long as they run in a hot environment and you keep the fans working.

Im thinking the lower speeds + the constant heat may be whats killing the RAM... I dont know, its just weird and really sucks.
Ive tested my other 2 5970's and their RAM is still good. So, thats 2 thats been bad and 2 thats been good - for me.

Ive never, NEVER, overvolted them either. Ran them stock or lower voltage.

did you follow Anti-static procedures/ power procedures when installing/removing them?
420
hero member
Activity: 756
Merit: 500
Anyone mine LTC and get the same result?
legendary
Activity: 1624
Merit: 1001
All cryptos are FIAT digital currency. Do not use.
This type of hardware failure is to be expected when you consider how hot the vrms run on the HD 59/5800 series.

Anything over 65C will severely shorten your gear's lifespan. It is a complete farse that the parts makers want us to believe that 70C+ is acceptable.
legendary
Activity: 1386
Merit: 1004
do we have to worry about nvidia cards too :O  Cry Shocked

Well so far Ive only seen this behaviour in 5970's .. my 5850's & 5830 seem to be fine still, as well as my 6950.
Who knows bout nv cards =P

I run 14 5850 cards, and not a single one of them has had video/display issues. 2 of those I use for active gaming while mining. (too lazy to turn off mining when I want to play games, so they've got really low priorities set) I'm selling off a few of the cards to more gamers, so I guess they'll let me know if they've found issues through prolonged gaming sessions.

I can now after a long period agree with the OP.  I run all kinds of cards and only the 5970's have done this.  I have pretty conservative settings, underclocked on most and undervolted.  Cards now fail on stock settings, fans are good.  I am going to attempt to re-pad and replace thermal compound. 
full member
Activity: 155
Merit: 100
do we have to worry about nvidia cards too :O  Cry Shocked

Well so far Ive only seen this behaviour in 5970's .. my 5850's & 5830 seem to be fine still, as well as my 6950.
Who knows bout nv cards =P

I run 14 5850 cards, and not a single one of them has had video/display issues. 2 of those I use for active gaming while mining. (too lazy to turn off mining when I want to play games, so they've got really low priorities set) I'm selling off a few of the cards to more gamers, so I guess they'll let me know if they've found issues through prolonged gaming sessions.
full member
Activity: 126
Merit: 100
Quote
had to replace a few fans, but no other issue

Are your fans blowing down into the cards? No air-flow over the board components would cause this.
Other than that, you can get mini RAM heatsinks off ebay.

Also happens to cards with water blocks...

Video card water block and copper RAM heatsinks
http://www.youtube.com/watch?v=VXJ0u1J9mRU
legendary
Activity: 952
Merit: 1000
I always used CGMiner to underclock the RAM, as it works great on 5xxx cards. I always kept it at the stock 300MHz tho, and I never had any problems.

The idle speeds for those cards was 150MHz core, and 300MHz mem. I know it doesn't make any sense, but I never liked going lower than the stock minimum, so I kept it there. Worked well for me, I guess, as I never had any cards go bad.
legendary
Activity: 1624
Merit: 1001
All cryptos are FIAT digital currency. Do not use.
What method are you guys using to downclock the gpu's ram ?

With voltage control/monitoring enabled, I've noticed that MSI Afterburner does not set the volts to the number I dialed in.

Use HWmonitor, GPUz and/or Speedfan to monitor the gpus volts to see if they are where they should be under load.
hero member
Activity: 756
Merit: 501
There is more to Bitcoin than bitcoins.
I used to mine on 5830s, all of them ended up doing this during normal (non-mining) use:



Temperature when mining was typically 62-68C. Sometime they would be ok for days or weeks, then go crazy with checkerboard artifacts. Hardware acceleration in Firefox or 3D applications typically make things worse. ATI driver crashes, etc.

Not sure how reliable it is, but MemtestCL reports

Code:
Test summary:
-----------------------------------------
50 iterations over 128 MiB of memory on device Cypress
      Moving inversions (ones and zeros): 0 failed iterations
                                         (0 total incorrect bits)
                 Memtest86 walking 8-bit: 0 failed iterations
                                         (0 total incorrect bits)
              True walking zeros (8-bit): 0 failed iterations
                                         (0 total incorrect bits)
               True walking ones (8-bit): 0 failed iterations
                                         (0 total incorrect bits)
              Moving inversions (random): 0 failed iterations
                                         (0 total incorrect bits)
             True walking zeros (32-bit): 0 failed iterations
                                         (0 total incorrect bits)
              True walking ones (32-bit): 0 failed iterations
                                         (0 total incorrect bits)
                           Random blocks: 3 failed iterations
                                         (2961 total incorrect bits)
                     Memtest86 Modulo-20: 0 failed iterations
                                         (0 total incorrect bits)
                           Integer logic: 0 failed iterations
                                         (0 total incorrect bits)
                 Integer logic (4 loops): 0 failed iterations
                                         (0 total incorrect bits)
            Integer logic (local memory): 0 failed iterations
                                         (0 total incorrect bits)
   Integer logic (4 loops, local memory): 0 failed iterations
                                         (0 total incorrect bits)
Final error count: 3 test iterations with at least one error; 2961 errors total

and stuff like

Code:
Error at [3864886C]: must be 00000004, but found 04000004 (bits: 00000100000000000000000000000000)
Error at [0009EB60]: must be 00000100, but found 44060100 (bits: 01000100000001100000000000000000)
Error at [0009EB64]: must be 00000100, but found 12100100 (bits: 00010010000100000000000000000000)
Error at [0009EB68]: must be 00000100, but found 22060100 (bits: 00100010000001100000000000000000)
Error at [0009EB6C]: must be 00000100, but found 74020100 (bits: 01110100000000100000000000000000)

I've sold locally all but one card, and never heard back from the buyers, even though the deal was to hold onto their money for a week until they check if cards work in their systems. Is it VRAM? Is it something about my motherboard or power supply? My system RAM tests ok.  No idea. I'll RMA this card and see what they find out.

Pages:
Jump to: