Graphics cards locking up - why so randomly?

P4man

hero member

Activity: 518

Merit: 500

Remember, the card isnt the only variable. There are other factors that may push it over its limit, like voltage fluctuations and temperature differences, and perhaps even all kinds of radiation, including cosmic rays, although Im unsure if frequency would play any role there.

That said, even under lab conditions with perfect power and stable atmosphere and perfectly shielded from everything, it will still be somewhat "random" but the frequency margin in which it happens is likely a lot narrower.

jake262144

full member

Activity: 210

Merit: 100

Quote from: dropt on April 02, 2012, 12:55:25 PM

Purely speculative, but I think using phoenix allows an instance of the miner for each GPU. Thus you can can monitor each process independently. You can then kill and reload that specific hung process through the OS whereas cgminer doesn't quite offer the high level monitoring/control. If a card hangs in cgminer, you're at the mercy of cgminer to recognize the fault and try to re-initialize that card. From my experience with cgminer, if this occurs it generally means a full system hang, but I've not played with phoenix to see if a hung process operates in a similar manner.

Nothing prevents you from launching a separate cgminer instance for each of your GPUs.

The issue with AMD cards from 6xxx generation onwards is that when they go down, they usually go down hard - not only will you be unable to resurrect them but if it is the system's primary⁽¹⁾ GPU that dies it can effectively take the whole system down by introducing freeze periods of a few dozen seconds each time the kernel is trying to access the non-responsive card.
Luckily, reboot -f has worked quite reliably for me, though situations where cycling the power was necessary have been reported in these forums.

Notes:
(1) usually closest to the CPU socket unless GPU ordering is changed in BIOS

jake262144

full member

Activity: 210

Merit: 100

Quote from: DeathAndTaxes on April 03, 2012, 02:01:43 PM

Likely means excessive overclock or a card about to fail.

...or (memory) underclock.
Not only am I talking about going below the physical capabilities of the ram chips but some particularly bad core-to-memory clock ratios can erally destabilize the GPUs with symptoms ranging from performance holes, through HW errors, to hard crashes.
I personally believe this rare but repeatable stability loss might be the core reason for AMD imposing the memdiff since the 6xxx generation cards.

I like some of your posts here, Death, especially your touching on the extremely subjective notion of card stability.
How about we formalize (in an informal way Tongue

) and shorten some of your stability thresholds:
#define CIS "crash in seconds" #define CIH "crash in hours" #define CIW "crash in weeks" #define CIM "crash in months" #define CIY "crash in years" //the holy grail of overclocked mining

I've seen the CIS to CIM delta to range from an astounding 8 MHz (that's a great overclock) to the unimpressive 35 MHz (an XFX 6970 that really pissed me off until I understood its finickyness).

DeathAndTaxes

donator

Activity: 1218

Merit: 1080

Gerald Davis

Most pools combine stale shares with invalid shares and provide a single stat. Any pool reporting invalid shares separately also works. Without knowing what % (if any) are bad shares (not just stale ones) the % doesn't really tell you anything.

BTW: cgminer checks the nonce returned by the GPU with the CPU (only occurs once every 4 billion nonces so it is minimal CPU load). If it detects a HW error it never submits it to the pool and increments the HW counter). cgminer isn't a magic bullet. Sometimes you will have no HW errors and still have cards lock up but HW > 0 is a bad sign. Likely means excessive overclock or a card about to fail.

Nancarrow

hero member

Activity: 492

Merit: 503

Quote from: DeathAndTaxes on April 02, 2012, 12:25:30 PM

Neither miner nor driver are shutting down the GPU it simply stops responding to commands. The GPU is almost like a self cointained computing environment (think OS for math). The host system gives the GPU a kernel to run, provides inputs, collects output, and periodically provides GPU control instructions. Other than that the GPU operates autonomously.

If the GPU is making errors some of those errors could manifest themselves in just computations (2+2=5) some can manifest themselves in flow control. The first will manifest themselves are HW error. The second can manifest themselves are an unresponsive GPU.

When a GPU crashes it often is still at full load. So it is doing "something" just not anything useful and no longer responding to any command and control signals from the driver. There is nothing you can do to avoid that other than not pushing GPU past their point of stability.

Aww. What a great post, completely spoiled by the ending.

Actually, would I even need cgminer to detect HW errors? I'm not familiar with it, but surely it just detects them the same way other miners do - when the GPU reports a good share, it sends it to the pool, who replies 'bzzt piss off this is crap', resulting in an invalid share? As it happens my pool reports very few stales (<0.5% long term) and NO duplicates or 'others' which is I presume what invalid shares are?

I dunno I'm kind of rambling here. What was my point again?

Gomeler

hero member

Activity: 697

Merit: 500

Quote from: DeathAndTaxes on April 02, 2012, 11:51:27 AM

If it is just one why not just run the one "problem card" at higher memclock and save (by your calculations ... 30w)?

Not worth the trouble. Who knows if another card will exhibit the same behavior but it was masked due to the more finicky card. 300 MHz on the ram has been working fine for about 2 weeks now and I got tired of fiddling with it.

dropt

legendary

Activity: 1512

Merit: 1000

Quote from: DeathAndTaxes on April 02, 2012, 12:25:30 PM

As for why does author of BAMT push new users towards cgminer. He feels phoenix is superior. Some of us disagree but he has marginalized cgminer in favor of phoenix. That likely isn't ever going to change.

Purely speculative, but I think using phoenix allows an instance of the miner for each GPU. Thus you can can monitor each process independently. You can then kill and reload that specific hung process through the OS whereas cgminer doesn't quite offer the high level monitoring/control. If a card hangs in cgminer, you're at the mercy of cgminer to recognize the fault and try to re-initialize that card. From my experience with cgminer, if this occurs it generally means a full system hang, but I've not played with phoenix to see if a hung process operates in a similar manner.

Again, purely speculative based off my experience, I could be way off.

DeathAndTaxes

donator

Activity: 1218

Merit: 1080

Gerald Davis

Quote from: Nancarrow on April 02, 2012, 12:04:50 PM

Also, how exactly does the GPU go from 'giving stupid answers' to 'not functioning anymore'? Is there something in the drivers that checks for computing errors and shuts it down if there's too many? Or is it in the miner software? And how can I disable such a process (even at the cost of getting loads of rejected shares)?

Neither miner nor driver are shutting down the GPU it simply stops responding to commands. The GPU is almost like a self cointained computing environment (think OS for math). The host system gives the GPU a kernel to run, provides inputs, collects output, and periodically provides GPU control instructions. Other than that the GPU operates autonomously.

If the GPU is making errors some of those errors could manifest themselves in just computations (2+2=5) some can manifest themselves in flow control. The first will manifest themselves are HW error. The second can manifest themselves are an unresponsive GPU.

When a GPU crashes it often is still at full load. So it is doing "something" just not anything useful and no longer responding to any command and control signals from the driver. There is nothing you can do to avoid that other than not pushing GPU past their point of stability.

As for why does author of BAMT push new users towards cgminer. He feels phoenix is superior. Some of us disagree but he has marginalized cgminer in favor of phoenix. That likely isn't ever going to change.

Nancarrow

hero member

Activity: 492

Merit: 503

Quote from: DeathAndTaxes on April 02, 2012, 08:51:40 AM

It isn't black or white. STABLE vs UNSTABLE. It is a gradient (hypothetical numbers):

Well, yes I do realise it isn't black or white, my question was really trying to ask 'WHY isn't it black and white?' Having said that, I can kind of think of an answer. It's (waves hand around vaguely) quantum mechanics, which if memory serves me correctly, is supposed to be based on fundamentally random processes anyway. Electronic processes are quantum processes, so I guess for each given hash there is a probability that the answer will be garbage, and I suppose that as a function of clock speed, this probability goes up in a probably vaguely sigmoid-looking way.

Quote

Another thing to look for (in cgminer) is HW errors. Those are caused by the card returning garbage as "work completed". It is a good sign the GPU is redlining and is right on the edge of stability.

Now that sounds like a good idea. I use BAMT, which I understand has cgminer as an option. Is there a good reason why the author steers newbs towards Phoenix and away from cgminer?

Also, how exactly does the GPU go from 'giving stupid answers' to 'not functioning anymore'? Is there something in the drivers that checks for computing errors and shuts it down if there's too many? Or is it in the miner software? And how can I disable such a process (even at the cost of getting loads of rejected shares)?

DeathAndTaxes

donator

Activity: 1218

Merit: 1080

Gerald Davis

If it is just one why not just run the one "problem card" at higher memclock and save (by your calculations ... 30w)?

Gomeler

hero member

Activity: 697

Merit: 500

Quote from: malevolent on April 02, 2012, 10:35:06 AM

Sometimes it is also connected to the core:mem clock ratio, I had cards which were stable at slightly higher clocks than in lower.

I have a 5970 within my quad 5970 rig that refused to run with a mem clock lower than ~270 MHz. All the other cards run happily at 150 MHz on the mem but this one damn card will BSOD the box at random. Took a long time to figure that out. So I now run them at 300 MHz, consume about 40 extra watts but it runs stable for weeks at a time.

malevolent

legendary

Activity: 3472

Merit: 1727

Sometimes it is also connected to the core:mem clock ratio, I had cards which were stable at slightly higher clocks than in lower.

DeathAndTaxes

donator

Activity: 1218

Merit: 1080

Gerald Davis

Quote from: Grinder on April 02, 2012, 08:40:22 AM

It's like speeding. It won't kill you every time, but it reduces the margin of error and increases your chance of crashing and dying.

Very good analogy.

To the OP once you find that speed where cards are stable for hours but not days then you are close but still "speeding" a little too much. Try dropping the clock 5-10 Mhz (remember every single GPU is different, there is no "standard" speed). Cards will likely now be stable for 72 hours+. At that point you need to decide if a reboot every 3 days is better/worse than dropping the cards another 5-10Mhz lower (where they may be stable for 30+ days).

Another thing to look for (in cgminer) is HW errors. Those are caused by the card returning garbage as "work completed". It is a good sign the GPU is redlining and is right on the edge of stability.

It isn't black or white. STABLE vs UNSTABLE. It is a gradient (hypothetical numbers):
Crash instantly (0.0 ms) - max speed
Crash in seconds - slightly lower
crash in minutes - slightly lower
crash in hours - slightly lower
crash in weeks - slightly lower
crash in months - slightly lower

Grinder

legendary

Activity: 1284

Merit: 1001

It's like speeding. It won't kill you every time, but it reduces the margin of error and increases your chance of crashing and dying.

Cablez

legendary

Activity: 1400

Merit: 1000

I owe my soul to the Bitcoin code...

Yeah, when you see something like this it usually means 'I am not quite happy at these clocks'.

I still have specific cards that I have not gotten fully stable yet, they never seem happy.

Not every chip made is workhorse, some are just glue. Grin

Dyaheon

member

Activity: 121

Merit: 10

It just means that there's a very slight chance for something to go wrong, and it will eventually happen.

I've had my almost-stable GPUs even run for days or weeks and then crashing.

Sometimes they're weird though... as in, having a card be stable for quite a long time and then underclocking it by some 5-10MHz and it still doing the same. You'd think that taking 5MHz off of a almost-stable card would make it "fully" stable, but alas that isn't always the case. Tongue

Nancarrow

hero member

Activity: 492

Merit: 503

So I can understand a GPU running for months without a problem. And I can understand setting the engine clock to 1200MHz and the GPU telling me to f Angry

k off as soon as I start mining. What I don't get is the pattern I observe when I'm testing out how much juice I can get out of one, and it runs for nine hours, or even nineteen, and then hangs up.

Hash #1: I don't like the speed you've got me running at, but I'll do this quietly and not say anything about it.
Hash #2: Here you go.
Hash #3: Here you go.
...
Hash #6,704,333,152: Here you go.
Hash #6,704,333,153: Here you go.
Hash #6,704,333,154: Here you go.
...
Hash #24,897,634,216,605: Here you go.
Hash #24,897,634,216,606: Here you go.
Hash #24,897,634,216,607: OH MY GOD PANIC DIE BLAAAAAAAAAAAAARGH.

What the hell? I'm sure it can't be the GPU temperature, as that never goes more than the mid-70s.

Topic: Graphics cards locking up - why so randomly? (Read 1366 times)