Author

Topic: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate (Read 2122 times)

sr. member
Activity: 392
Merit: 250
Too start I would replace the thermal paste on the hot card. Then I would run it with just one card at a time and see if they both give rejected shares. It appears they both are, but it possible its just one. After you eliminate it being a problem with the video card. You can assume its a driver or software problem. Beyond that I would check the internet connection. I would just install windows, and run cg miner. That would be a quick fix to eliminate software problems.
newbie
Activity: 20
Merit: 0
@deepceleron: Thanks, I'll give that a try this weekend.  Post or PM me your BTC addy, I'll send you the bounty.
member
Activity: 266
Merit: 36
... 700mhz is stock for 5970's. ...

Says here: Engine clock speed: 725 MHz
full member
Activity: 134
Merit: 100
The 5970s have a rear exhaust right? Which way is the box fan pointed?

 they are in the correct direction relative to fan

What I meant was, Is the box fan pushing the exhaust heat back into or away from the card?
legendary
Activity: 1344
Merit: 1004
The 5970s have a rear exhaust right? Which way is the box fan pointed?

 they are in the correct direction relative to fan
full member
Activity: 134
Merit: 100
The 5970s have a rear exhaust right? Which way is the box fan pointed?
legendary
Activity: 1512
Merit: 1036
@deepceleron: you nailed it with upgrading to phoenix 1.7.3 for lowering my reject rate.  Been running all day, reject rate is now at .134%!  I guess phoenix 1.7.0 had issues with rejected shares?

I lowered my core clock from 780mhz to 720mhz; 700mhz is stock for 5970's.  This lowered my hashrate by 100 MHash (not a big deal).  I still had a couple of lockups today though.  I'll bring it down to 700mhz tomorrow.

As of now at least 1.5 btc of the bounty is going to deepceleron, probably the remaining 1.5 btc too if there are no more suggestions.

That was a bugfix that was rolled out, I didn't find a definitive post that stated "this fixes all your stales" in the Phoenix thread for you, but it no longer lags for .5 seconds, and it also supports rollntime so your miner doesn't have to request more work if it completes a nonce space (supported on some pools).

The one card that is running significantly hotter - you could pull it and make sure it has well-applied thermal grease. It should only have a very thin coating on components (not blobs squeezed out), and can benefit from an upgraded thermal paste like arctic silver 5. Clean the old stuff with 99% rubbing alcohol (or Everclear), and re-apply a paper thin coat of new TIM on the chips. Put the heatsink on with the normal screws and take it back off again. You should see from the thermal paste impression on the heatsink that there is good contact being made.

If you still lock up that close to core clock on the cards, I would remove one card at a time (maybe starting with that hot one), and and put your miner back to overclocked. Run the card you removed by itself mining in it's own computer (your desktop computer or a $50 Craigslist dell) and see if the problem follows one card.
newbie
Activity: 20
Merit: 0
@deepceleron: you nailed it with upgrading to phoenix 1.7.3 for lowering my reject rate.  Been running all day, reject rate is now at .134%!  I guess phoenix 1.7.0 had issues with rejected shares?

I lowered my core clock from 780mhz to 720mhz; 700mhz is stock for 5970's.  This lowered my hashrate by 100 MHash (not a big deal).  I still had a couple of lockups today though.  I'll bring it down to 700mhz tomorrow.

As of now at least 1.5 btc of the bounty is going to deepceleron, probably the remaining 1.5 btc too if there are no more suggestions.
newbie
Activity: 20
Merit: 0
@deepceleron: Thanks, I'll give those suggestions a go.  I'm not entirely opposed to lowering my core clock, especially if it will prolong the life of my cards.

@P4man
Quote
Also, clarify "lock ups". I have no experience with phoenix, but on cgminer, the rig will not lock up, just the cards will "die". Doesnt prevent me from SSH-ing in to the machine. Are you getting complete freezes? Have you checked dmesg log?

Thanks for the info.  By lockup, I don't mean that the rig itself completely locks up. It's pretty much as jake262144 described: it's a GPU hard lockup that freezes the miner.  I can still SSH in, but the OS is extremely laggy, and top sometimes reports that at least 1 instance of the phoenix miner is at 100% CPU.  I'm unable to "kill -9" the miner processes.  Smartcoin states that one or more GPU's is "<<< DOWN >>>".  The only recourse seems to be to reboot the rig.

@jake262144: Smartcoin has a lockup detection feature.  I haven't looked at the source, but from what I understand, if a GPU isn't responsive after X numer of  Smartcoin screen refreshes/iterations, it declares that a lockup has occurred.

You can put a lockup.sh file in the smartcoin directory, and Smartcoin will run it when this occurs.  I do a forced reboot as that seems to be the only way to revive the card(s):
Code:
#!/bin/bash
/etc/init.d/gdm stop
shutdown -fr now
smartcoin --kill
full member
Activity: 210
Merit: 100
*nods his head* P4, DeepC

When I was searching for maximum stable clocks for my cards I did notice that DEAD != DEAD.
When my 6950 DCII crashes, it crashes like a ton of bricks introducing lock-ups of a few dozen seconds to any interaction with the OS.
Apparently, some bigshot kernel-mode code freezes the OS up when attempting to speak with the dead, until it time-outs and lets go.
OTOH, necromancgminer has been doing a terrific job raising the 6770s from the dead with no fuss.

If the OS really froze up,the script wouldn't do you much good.
Mind sharing your reboot-magic? I want to see how your script detects the "lock up" condition.

And yes, do lower your overclock clocks.
Near-death lock-ups and crazy stale counts suggest that at least one of your cards can't take the beating.
hero member
Activity: 518
Merit: 500
Not what you want to hear, but I would run the cards closer to stock speed and then if it still locks up, overclock at least is eliminated as a cause of your problem. Pushing a card too hard would get me a random reboot every few days.

This. Stock speed and see what happens. Overclocks dont last forever, electro-migration will reduce stable overclocking speed over time (or kill the card outright). That time period is completely unpredictable, it could be weeks or decades. If youve seriously overheated the card, that isnt going to help; electro migration correlates exponentially with temperature.

Also, clarify "lock ups". I have no experience with phoenix, but on cgminer, the rig will not lock up, just the cards will "die". Doesnt prevent me from SSH-ing in to the machine. Are you getting complete freezes? Have you checked dmesg log?
legendary
Activity: 1512
Merit: 1036
Upgrade to phoenix 1.7.3, you'll see reject rate drop quickly. You could also switch to a pool with known low rejects that doesn't charge the highest fees in the pool biz (which is good for another 3%-10% "boost" to your bottom line).

Not what you want to hear, but I would run the cards closer to stock speed and then if it still locks up, overclock at least is eliminated as a cause of your problem. Pushing a card too hard would get me a random reboot every few days.
newbie
Activity: 20
Merit: 0
Hi.  I've got a dedicated mining rig that experiences lots of lockups and a high reject rate.  I need help figuring out why and/or what the problem(s) are.

I'll pay 3 BTC to the first person who's suggestions fix the issue, or at least help lower my lockups and reject rates.  If there's a bunch of folks with valid suggestions, I'll split the bounty among them.  This bounty is valid for the next 72 hours, so it ends on Friday, January 19th 8PM PST.

Ok, the two issues are:

  • my reject rate is always between 10% - 15% doing pooled mining on deepbit.net
  • my rig experiences hard lockups sometimes randomly, and sometimes in odd patterns.  I usually have about 3 lockups per 24 hours, which seem kind of high.  But when lockups occur, they sometimes continuously occur, up to 5 times or more in an hour.  Then they'll just stop, and the rig runs great for the rest of the day.  I have a script that reboots the machine on each lock up, and sends me a notification email.

My hardware info:

My software info:

Clocks on each GPU:
  • Core: 780 Mhz
  • Memory: 330 MHz
Code:
user@linuxcoin:~$ sudo aticonfig --odgc && sudo aticonfig --adapter=all --odgt && sudo aticonfig --pplib-cmd 'get activity'

Default Adapter - ATI Radeon HD 5900 Series
                            Core (MHz)    Memory (MHz)
           Current Clocks :    780           330
             Current Peak :    780           330
  Configurable Peak Range : [550-1000]     [1000-1500]
                 GPU load :    95%

Adapter 0 - ATI Radeon HD 5900 Series
            Sensor 0: Temperature - 82.00 C

Adapter 1 - ATI Radeon HD 5900 Series
            Sensor 0: Temperature - 69.00 C

Adapter 2 - ATI Radeon HD 5900 Series
            Sensor 0: Temperature - 61.50 C

Adapter 3 - ATI Radeon HD 5900 Series
            Sensor 0: Temperature - 62.50 C
Current Activity is Core Clock: 780MHZ
Memory Clock: 330MHZ
VDDC: 1050
Activity: 95 percent
Performance Level: 2
Bus Speed: 5000
Bus Lanes: 16
Maximum Bus Lanes: 16

Fan: 80%

Phoenix miner settings on each GPU:
Code:
WORKSIZE=256 VECTORS BFI_INT AGGRESSION=9 -k phatk2


Smartcoin status screen after running the rig for 24 hours with no lockups:

Code:
Smartcoin r657s 10:00:56
----------------------------------------
Host: localhost
0: Temp: 81.50 load: 98%
1: Temp: 68.50 load: 99%
2: Temp: 59.50 load: 99%
3: Temp: 60.50 load: 99%
CPU Load Avgs: 0.33 0.42 0.42

Profile: deepbit
--------DeepBit--------
0:      [349.65 MHash/sec] [3104 Accepted] [538 Rejected] [17.332% Rejected]
1:      [349.56 MHash/sec] [3133 Accepted] [527 Rejected] [16.820% Rejected]
2:      [348.48 MHash/sec] [3139 Accepted] [459 Rejected] [14.622% Rejected]
3:      [349.73 MHash/sec] [3139 Accepted] [456 Rejected] [14.526% Rejected]
Total : [1397.42 MHash/sec] [12515 Accepted] [1980 Rejected] [15.821% Rejected]

Grand Total : [1397.42 MHash/sec] [12515 Accepted] [1980 Rejected] [15.821% Rej]

Rig pic, my camera sucks:
https://i.imgur.com/iQZpG.jpg

Other hints:
  • I've been running this rig for the last 4 months, about 17 hours on each weekday, 24 hours on saturday & sunday.
  • I once accidentally left the fan at 20% while having the cards overclocked at 800mHZ for about an hour.  Shit!
  • I've tried using other mining pools, but the reject rate is always between 10% - 15%.
  • It's plugged into a kill-a-watt thing so I can track electricity uses.

My ideas to fix lockups:
  • replace the crappy USB stick hard drive with a higher quality one.
  • swap my 5970's positions: GPU[0] consistently has a much higher temp than the other GPU's.
  • switch from linuxcoin to ubuntu.

My ideas to fix high reject rate:
  • run the miner on a VPN so my ISP can't use QoS on my miner's traffic.

I know the simplest answer would be to lower my core clock, but I'd prefer not to do that if possible.  Let me know if more info is needed, thanks all!
Jump to: