Author

Topic: Graphics cards locking up - why so randomly? (Read 1359 times)

hero member
Activity: 518
Merit: 500
Remember, the card isnt the only variable. There are other factors that may push it over its limit, like voltage fluctuations and temperature differences, and perhaps even all kinds of radiation, including cosmic rays, although Im unsure if frequency would play any role there.

That said, even under lab conditions with perfect power and stable atmosphere and perfectly shielded from everything, it will still be somewhat "random" but the frequency margin in which it happens is likely a lot narrower.
full member
Activity: 210
Merit: 100
Purely speculative, but I think using phoenix allows an instance of the miner for each GPU.  Thus you can can monitor each process independently.  You can then kill and reload that specific hung process through the OS whereas cgminer doesn't quite offer the high level monitoring/control. If a card hangs in cgminer, you're at the mercy of cgminer to recognize the fault and try to re-initialize that card.  From my experience with cgminer, if this occurs it generally means a full system hang, but I've not played with phoenix to see if a hung process operates in a similar manner.
Nothing prevents you from launching a separate cgminer instance for each of your GPUs.

The issue with AMD cards from 6xxx generation onwards is that when they go down, they usually go down hard - not only will you be unable to resurrect them but if it is the system's primary(1) GPU that dies it can effectively take the whole system down by introducing freeze periods of a few dozen seconds each time the kernel is trying to access the non-responsive card.
Luckily, reboot -f has worked quite reliably for me, though situations where cycling the power was necessary have been reported in these forums.

Notes:
(1) usually closest to the CPU socket unless GPU ordering is changed in BIOS
full member
Activity: 210
Merit: 100
Likely means excessive overclock or a card about to fail.
...or (memory) underclock.
Not only am I talking about going below the physical capabilities of the ram chips but some particularly bad core-to-memory clock ratios can erally destabilize the GPUs with symptoms ranging from performance holes, through HW errors, to hard crashes.
I personally believe this rare but repeatable stability loss might be the core reason for AMD imposing the memdiff since the 6xxx generation cards.

I like some of your posts here, Death, especially your touching on the extremely subjective notion of card stability.
How about we formalize (in an informal way Tongue) and shorten some of your stability thresholds:
#define CIS "crash in seconds"
#define CIH "crash in hours"
#define CIW "crash in weeks"
#define CIM "crash in months"
#define CIY "crash in years"       //the holy grail of overclocked mining


I've seen the CIS to CIM delta to range from an astounding 8 MHz (that's a great overclock) to the unimpressive 35 MHz (an XFX 6970 that really pissed me off until I understood its finickyness).
donator
Activity: 1218
Merit: 1080
Gerald Davis
Most pools combine stale shares with invalid shares and provide a single stat.  Any pool reporting invalid shares separately also works.  Without knowing what % (if any) are bad shares (not just stale ones) the % doesn't really tell you anything.  


BTW: cgminer checks the nonce returned by the GPU with the CPU (only occurs once every 4 billion nonces so it is minimal CPU load).  If it detects a HW error it never submits it to the pool and increments the HW counter). cgminer isn't a magic bullet.  Sometimes you will have no HW errors and still have cards lock up but HW > 0 is a bad sign.  Likely means excessive overclock or a card about to fail.
hero member
Activity: 492
Merit: 503
Neither miner nor driver are shutting down the GPU it simply stops responding to commands.  The GPU is almost like a self cointained computing environment (think OS for math).  The host system gives the GPU a kernel to run, provides inputs, collects output, and periodically provides GPU control instructions.  Other than that the GPU operates autonomously. 

If the GPU is making errors some of those errors could manifest themselves in just computations (2+2=5) some can manifest themselves in flow control.  The first will manifest themselves are HW error.  The second can manifest themselves are an unresponsive GPU.

When a GPU crashes it often is still at full load.  So it is doing "something" just not anything useful and no longer responding to any command and control signals from the driver.  There is nothing you can do to avoid that other than not pushing GPU past their point of stability.

Aww. What a great post, completely spoiled by the ending.

Actually, would I even need cgminer to detect HW errors? I'm not familiar with it, but surely it just detects them the same way other miners do - when the GPU reports a good share, it sends it to the pool, who replies 'bzzt piss off this is crap', resulting in an invalid share? As it happens my pool reports very few stales (<0.5% long term) and NO duplicates or 'others' which is I presume what invalid shares are?

I dunno I'm kind of rambling here. What was my point again?
hero member
Activity: 697
Merit: 500
If it is just one why not just run the one "problem card" at higher memclock and save (by your calculations ... 30w)?

Not worth the trouble. Who knows if another card will exhibit the same behavior but it was masked due to the more finicky card. 300 MHz on the ram has been working fine for about 2 weeks now and I got tired of fiddling with it.
legendary
Activity: 1512
Merit: 1000


As for why does author of BAMT push new users towards cgminer.  He feels phoenix is superior.  Some of us disagree but he has marginalized cgminer in favor of phoenix.  That likely isn't ever going to change.

Purely speculative, but I think using phoenix allows an instance of the miner for each GPU.  Thus you can can monitor each process independently.  You can then kill and reload that specific hung process through the OS whereas cgminer doesn't quite offer the high level monitoring/control. If a card hangs in cgminer, you're at the mercy of cgminer to recognize the fault and try to re-initialize that card.  From my experience with cgminer, if this occurs it generally means a full system hang, but I've not played with phoenix to see if a hung process operates in a similar manner.

Again, purely speculative based off my experience, I could be way off.
donator
Activity: 1218
Merit: 1080
Gerald Davis
Also, how exactly does the GPU go from 'giving stupid answers' to 'not functioning anymore'? Is there something in the drivers that checks for computing errors and shuts it down if there's too many? Or is it in the miner software? And how can I disable such a process (even at the cost of getting loads of rejected shares)?

Neither miner nor driver are shutting down the GPU it simply stops responding to commands.  The GPU is almost like a self cointained computing environment (think OS for math).  The host system gives the GPU a kernel to run, provides inputs, collects output, and periodically provides GPU control instructions.  Other than that the GPU operates autonomously. 

If the GPU is making errors some of those errors could manifest themselves in just computations (2+2=5) some can manifest themselves in flow control.  The first will manifest themselves are HW error.  The second can manifest themselves are an unresponsive GPU.

When a GPU crashes it often is still at full load.  So it is doing "something" just not anything useful and no longer responding to any command and control signals from the driver.  There is nothing you can do to avoid that other than not pushing GPU past their point of stability.

As for why does author of BAMT push new users towards cgminer.  He feels phoenix is superior.  Some of us disagree but he has marginalized cgminer in favor of phoenix.  That likely isn't ever going to change.
hero member
Activity: 492
Merit: 503
It isn't black or white.  STABLE vs UNSTABLE.  It is a gradient (hypothetical numbers):

Well, yes I do realise it isn't black or white, my question was really trying to ask 'WHY isn't it black and white?' Having said that, I can kind of think of an answer. It's (waves hand around vaguely) quantum mechanics, which if memory serves me correctly, is supposed to be based on fundamentally random processes anyway. Electronic processes are quantum processes, so I guess for each given hash there is a probability that the answer will be garbage, and I suppose that as a function of clock speed, this probability goes up in a probably vaguely sigmoid-looking way.

Quote
Another thing to look for (in cgminer) is HW errors.  Those are caused by the card returning garbage as "work completed".  It is a good sign the GPU is redlining and is right on the edge of stability.

Now that sounds like a good idea. I use BAMT, which I understand has cgminer as an option. Is there a good reason why the author steers newbs towards Phoenix and away from cgminer?

Also, how exactly does the GPU go from 'giving stupid answers' to 'not functioning anymore'? Is there something in the drivers that checks for computing errors and shuts it down if there's too many? Or is it in the miner software? And how can I disable such a process (even at the cost of getting loads of rejected shares)?
donator
Activity: 1218
Merit: 1080
Gerald Davis
If it is just one why not just run the one "problem card" at higher memclock and save (by your calculations ... 30w)?
hero member
Activity: 697
Merit: 500
Sometimes it is also connected to the core:mem clock ratio, I had cards which were stable at slightly higher clocks than in lower.

I have a 5970 within my quad 5970 rig that refused to run with a mem clock lower than ~270 MHz. All the other cards run happily at 150 MHz on the mem but this one damn card will BSOD the box at random. Took a long time to figure that out. So I now run them at 300 MHz, consume about 40 extra watts but it runs stable for weeks at a time.
legendary
Activity: 3472
Merit: 1727
Sometimes it is also connected to the core:mem clock ratio, I had cards which were stable at slightly higher clocks than in lower.
donator
Activity: 1218
Merit: 1080
Gerald Davis
It's like speeding. It won't kill you every time, but it reduces the margin of error and increases your chance of crashing and dying.

Very good analogy.  

To the OP once you find that speed where cards are stable for hours but not days then you are close but still "speeding" a little too much.  Try dropping the clock 5-10 Mhz (remember every single GPU is different, there is no "standard" speed).  Cards will likely now be stable for 72 hours+.  At that point you need to decide if a reboot every 3 days is better/worse than dropping the cards another 5-10Mhz lower (where they may be stable for 30+ days).  

Another thing to look for (in cgminer) is HW errors.  Those are caused by the card returning garbage as "work completed".  It is a good sign the GPU is redlining and is right on the edge of stability.

It isn't black or white.  STABLE vs UNSTABLE.  It is a gradient (hypothetical numbers):
Crash instantly (0.0 ms)  - max speed
Crash in seconds - slightly lower
crash in minutes - slightly lower
crash in hours - slightly lower
crash in weeks - slightly lower
crash in months - slightly lower
legendary
Activity: 1284
Merit: 1001
It's like speeding. It won't kill you every time, but it reduces the margin of error and increases your chance of crashing and dying.
legendary
Activity: 1400
Merit: 1000
I owe my soul to the Bitcoin code...
Yeah, when you see something like this it usually means 'I am not quite happy at these clocks'.

I still have specific cards that I have not gotten fully stable yet, they never seem happy.

Not every chip made is workhorse, some are just glue.  Grin
member
Activity: 121
Merit: 10
It just means that there's a very slight chance for something to go wrong, and it will eventually happen.

I've had my almost-stable GPUs even run for days or weeks and then crashing.

Sometimes they're weird though... as in, having a card be stable for quite a long time and then underclocking it by some 5-10MHz and it still doing the same. You'd think that taking 5MHz off of a almost-stable card would make it "fully" stable, but alas that isn't always the case.  Tongue
hero member
Activity: 492
Merit: 503
So I can understand a GPU running for months without a problem. And I can understand setting the engine clock to 1200MHz and the GPU telling me to f  Angry k off as soon as I start mining. What I don't get is the pattern I observe when I'm testing out how much juice I can get out of one, and it runs for nine hours, or even nineteen, and then hangs up.

Hash #1: I don't like the speed you've got me running at, but I'll do this quietly and not say anything about it.
Hash #2: Here you go.
Hash #3: Here you go.
...
Hash #6,704,333,152: Here you go.
Hash #6,704,333,153: Here you go.
Hash #6,704,333,154: Here you go.
...
Hash #24,897,634,216,605: Here you go.
Hash #24,897,634,216,606: Here you go.
Hash #24,897,634,216,607: OH MY GOD PANIC DIE BLAAAAAAAAAAAAARGH.

What the hell? I'm sure it can't be the GPU temperature, as that never goes more than the mid-70s.
Jump to: