Author

Topic: Interesting 5970 HW Problem (Read 4527 times)

hero member
Activity: 896
Merit: 532
Former curator of The Bitcoin Museum
October 31, 2012, 07:31:12 AM
#16
With sitting in a 35-40 Degree Celcius Environment, they sit at about 75-80 degrees (some vary to 85%)

When they were sitting on my balcony they were sitting at 70-75.

This is with the LATEST ( 2011) bios updates, and fan set to automatic.  The fans are usually zooming away at 80% (4200RPM)

legendary
Activity: 1274
Merit: 1004
October 29, 2012, 01:23:14 PM
#15
the most stable settings i've seen so far with a 5970 is
1.075v
800-820 (core clock)
300 (mem)

Wow, what kind of temps do you get with that overclock and overvolt?
hero member
Activity: 896
Merit: 532
Former curator of The Bitcoin Museum
October 29, 2012, 10:35:17 AM
#14
the most stable settings i've seen so far with a 5970 is
1.075v
800-820 (core clock)
300 (mem)
legendary
Activity: 1988
Merit: 1012
Beyond Imagination
October 28, 2012, 03:16:50 PM
#13
shouldn't they be set at least 300mhz?

Tried many different RAM clock combination without luck. I've been running at 150Mhz for months without a problem, I think it is the GPU who start to degrade, but difficult to prove, since raise the voltage does not really help
hero member
Activity: 896
Merit: 532
Former curator of The Bitcoin Museum
October 28, 2012, 09:50:04 AM
#12
shouldn't they be set at least 300mhz?
legendary
Activity: 1988
Merit: 1012
Beyond Imagination
October 28, 2012, 04:09:52 AM
#11
Same here, one of my 5970 core now runs with 50% HW error, tried everything: changing voltage, frequency, replace thermal compund, nothing works. I'm running 2.3.3 version of cgminer and never had problem before

After cgminer started, first several minutes it works without problem, but when temperature reached above 40c degrees, this error start to appear, this happened weeks after I changed the heat sink to Accelero xtreme and thermal compund to coolaboratory liquid ultra, so the GPU cooling actually got much better

Maybe as someone said, it is the RAM problem, since I did not put any heatpad on RAM with new cooler, but anyway they stayed at 150Mhz
sr. member
Activity: 472
Merit: 250
October 28, 2012, 01:49:34 AM
#10
One of my rigs had a very similar issue. I put two 5970s together into one machine. I copied my 2.7.5 folder from a machine that was running the exact same setup (64 ultimate, 11.12, SDK 2.1, cgminer 2.7.5). When I fired up cgminer I would only get hw errors. It refused to accept any shares no matter what I had the cards clocked at. To test I installed 2.7.7 and it has worked fine now without any issues what so ever.
hero member
Activity: 658
Merit: 500
October 26, 2012, 01:00:22 PM
#9
How are you controlling your clocks? Flash modding or afterburner or cgminer or ?
legendary
Activity: 2450
Merit: 1002
October 26, 2012, 11:57:28 AM
#8
Actually, I never looked at the error in debug output, it went by too fast. I just noticed when HW error was thrown in normal mode, it said "HW error, invalid nonce"

I got the HW errors to vanish @ 170mhz RAM.... if it doesnt work at 150mhz RAM, your RAM is most likely completely fuxxored. I still strongly suspect mining...somehow damages 5970's RAM.
legendary
Activity: 1274
Merit: 1004
October 26, 2012, 10:43:45 AM
#7
Ive had a problem similar to this, but my 5970(one gpu) flagged HW errors bout 25% of the time. Mine is because the RAM gradually got worse and worse. By lowering the RAM from 300 to 170, the HW errors are gone, for now.
Are your HW errors "invalid nonce" errors? thats what mine appear as for the shares in cgminer.
Im willing to bet theres not a single ounce of RAM good on that GPU thats flagging the bad errors.

 [2012-10-26 09:16:20] GPU 3 found something?
 [2012-10-26 09:16:20] OCL NONCE 1931483178 found in slot 0
 [2012-10-26 09:16:20] No best_g found! Error in OpenCL code?

For comparison, this is what I normally see
 [2012-10-26 09:16:26] GPU 2 found something?
 [2012-10-26 09:16:26] OCL NONCE 2442951545 found in slot 0
 [2012-10-26 09:16:26]  Proof: 00000000410ba6fe4fff57acaa2a9a4a358e5fdbc20a441a6951b215356b1725

I'll try running the RAM at 150 instead of 200 to see if that helps.
legendary
Activity: 2450
Merit: 1002
October 26, 2012, 09:49:43 AM
#6
Ive had a problem similar to this, but my 5970(one gpu) flagged HW errors bout 25% of the time. Mine is because the RAM gradually got worse and worse. By lowering the RAM from 300 to 170, the HW errors are gone, for now.
Are your HW errors "invalid nonce" errors? thats what mine appear as for the shares in cgminer.
Im willing to bet theres not a single ounce of RAM good on that GPU thats flagging the bad errors.
hero member
Activity: 896
Merit: 532
Former curator of The Bitcoin Museum
October 26, 2012, 07:54:36 AM
#5
crack open the GPU, clean off any thermal paste and just put some new stuff on.

come back and tell us what happens
legendary
Activity: 1274
Merit: 1004
October 25, 2012, 11:14:09 PM
#4
Open up gpuz and look at all the gpu temps on the sensor tab. Not just the "main" one. I'll bet you have a hot spot on part of the core somewhere where the thermal paste is migrating. I've had mine report the temp at 55 and had a hot spot that was 87 .

At 750MHz/200MHz/1V they're running 68/63/62, with the GPU VRMs at 77/78/76. As I said though, this is independent of clock speed or voltage and by extension temperature. I can run at 400MHz/0.95V and turn off the other core and I still see utilization about half of what it should be, with an (almost) equal distribution of HW errors and accepted shares.
hero member
Activity: 854
Merit: 500
einc.io
October 25, 2012, 11:08:57 PM
#3
Sorry, I can't help.
I have never seen it before either.

I restarted 1 of my rigs last month with TEAMVIEWER on my mobile phone and after the restart my 7950 in that rig only hash with 50 MH/s while the 6970 are hashing normal.
I never had any problems with TEAMVIEWER before.
Everything I try, nothing help until I uninstall all the drivers and install it again.
I also have CGminer 2.7.5 installed on that rig.
I already upgrade CGminer to the latest version now.
You try uninstall en reinstall GPU drivers already.?

I hope reinstalling drivers will help
hero member
Activity: 658
Merit: 500
October 25, 2012, 11:03:45 PM
#2
Open up gpuz and look at all the gpu temps on the sensor tab. Not just the "main" one. I'll bet you have a hot spot on part of the core somewhere where the thermal paste is migrating. I've had mine report the temp at 55 and had a hot spot that was 87 .
legendary
Activity: 1274
Merit: 1004
October 25, 2012, 10:37:12 PM
#1
I'm having an issue with one of my 5970s, and it's kind of stumping me. In the last couple days, on of the cores of my has started to show hardware errors in CGminer. It's only the one core, and it happens regardless of frequency or voltage. The weird thing is that the number of accepted shares is the same as the number of HW errors. IE, if I have 3000 accepted shares I'll have close to 3000 HW errors, within less than 1%.

I'm using CGminer 2.7.5 and the Diablo kernel, and restarts don't seem to help. Normally, if I saw this distribution of errors in what should be a random process I would think a bit is stuck, but I'm not sure how that would apply in this case. Has anyone seen anything like this or have any ideas what could be causing it?
Jump to: