mining rig keeps dying. Too hot? | Bitcointalksearch.org

JaredKaragen

legendary

Activity: 1848

Merit: 1166

My AR-15 ID's itself as a toaster. Want breakfast?

Quote from: krazitrain on August 02, 2017, 04:00:16 PM

Quote from: MingMining on August 01, 2017, 11:36:41 PM

It may well be a sick card. well it is not sick, it just cannot handle your overclock. Even they are all the same model, one of them may not be able to survive the memory overclock. Actually this is what happened to my rig.

1. read the log and locate which gpu has hanged first or error first. say cpu 0.
2. what I did is pressing 0 to disable GPU 0 in the rig.
3. watch if your rig is stable. If it is, congratulation, you found the sick card.
4. Try one by one till you find it.

or you can just disable watchdog in your bat file, set -wd 0. In this case, the sick card will stop but your other cards keep working. Now touch the card and feel the temp you will know which one is bad.

If you cannot find a card to blame, then you may need to reinstall the windows and update to the latest.

Hope this helps.

Okay. I would like to do this but I am not sure where you disable just one of the GPU's.

Since my device 3 has been the one to stop first the last two times after reading my log, I am guessing that that is the problem. I'm running Ali carts again now to see if I get that same device to fail. And trying to figure out how to disable the deviceand determine which card it actually is.

use nvidia inspector to find the GPU number in question.

what miner app are you using?

assuming ccminer, use the -d flag to choose what devices to use:
-d 0,1,3,4,5
and with ccminer you can trim intensity per-card as well:
-i 17,17,13,17,17

krazitrain

newbie

Activity: 15

Merit: 0

Quote from: MingMining on August 01, 2017, 11:36:41 PM

It may well be a sick card. well it is not sick, it just cannot handle your overclock. Even they are all the same model, one of them may not be able to survive the memory overclock. Actually this is what happened to my rig.

1. read the log and locate which gpu has hanged first or error first. say cpu 0.
2. what I did is pressing 0 to disable GPU 0 in the rig.
3. watch if your rig is stable. If it is, congratulation, you found the sick card.
4. Try one by one till you find it.

or you can just disable watchdog in your bat file, set -wd 0. In this case, the sick card will stop but your other cards keep working. Now touch the card and feel the temp you will know which one is bad.

If you cannot find a card to blame, then you may need to reinstall the windows and update to the latest.

Hope this helps.

Okay. I would like to do this but I am not sure where you disable just one of the GPU's.

Since my device 3 has been the one to stop first the last two times after reading my log, I am guessing that that is the problem. I'm running Ali carts again now to see if I get that same device to fail. And trying to figure out how to disable the deviceand determine which card it actually is.

krazitrain

newbie

Activity: 15

Merit: 0

So it ran for 10 hours last night and then the devices fell off line.

I am mining zczsh. When it stops working correctly, it gives me the reading that 0 sols are being produced. The miner window is still up and all of the GPU's keep trying to restart but can't.

The last two times, the first device listed is, device 3. With the following log readings.

Device 3 thread exited with code: 77

Then all of the other GPU's exit with the same code 77.

jenia1

sr. member

Activity: 504

Merit: 267

HashWare - Mining solutions for everyone!

sorry i didnt read all replys. but i have also experienced the same symptoms. for me it was crashing because of too much OC and high temps. i then had to underclock manually every card that has caused the rig to crash. this is tidious work i know. but the only way i managed to make it stable.

MingMining

member

Activity: 202

Merit: 10

Eloncoin.org - Mars, here we come!

It may well be a sick card. well it is not sick, it just cannot handle your overclock. Even they are all the same model, one of them may not be able to survive the memory overclock. Actually this is what happened to my rig.

1. read the log and locate which gpu has hanged first or error first. say cpu 0.
2. what I did is pressing 0 to disable GPU 0 in the rig.
3. watch if your rig is stable. If it is, congratulation, you found the sick card.
4. Try one by one till you find it.

or you can just disable watchdog in your bat file, set -wd 0. In this case, the sick card will stop but your other cards keep working. Now touch the card and feel the temp you will know which one is bad.

If you cannot find a card to blame, then you may need to reinstall the windows and update to the latest.

Hope this helps.

philipma1957

legendary

Activity: 4354

Merit: 9201

'The right to privacy matters'

Quote from: krazitrain on August 01, 2017, 10:58:37 PM

Model Number: VCGGTX10603XGPB-OC-BB PNY GTX 1060 3GB XLR8 VERSION

Yesterday, I tore it down so that they were just two cards running on the 1000 PSU. I ran it all day with a power limit of 70 and memory clock +476 individually on both cards.
I added the third card when I went to bed and woke up to a frozen screen again. So, I replaced the riser on the third card..

Six hours ago, I attached two more cards to the 1000 PSU for 5 total. But this time I placed settings for all five at power limit 70, memory clock + 476, core +46..

I also reinstalled Nvidia drivers to the latest, version 384.94

It's been running fine for the last six hours with temperatures on the cards between 66 and 69°C. I don't have any fans on them and the window slightly open. It's been 96°Fahrenheit out today. air-conditioners in the other room.

471 W total.

I guess I'll let it run through the night and see what happens. tomorrow add the sixth card off the 850 PSU.

So I'm not sure if was a riser or settings. I just hope it's working good now..

So these five let you do 70 percent tdp and ran overnight.

That sixth card could be the issue.
I have had issues like this.

Ran the odd ball card in a one or two card rig and it works well.

krazitrain

newbie

Activity: 15

Merit: 0

Model Number: VCGGTX10603XGPB-OC-BB PNY GTX 1060 3GB XLR8 VERSION

Yesterday, I tore it down so that they were just two cards running on the 1000 PSU. I ran it all day with a power limit of 70 and memory clock +476 individually on both cards.
I added the third card when I went to bed and woke up to a frozen screen again. So, I replaced the riser on the third card..

Six hours ago, I attached two more cards to the 1000 PSU for 5 total. But this time I placed settings for all five at power limit 70, memory clock + 476, core +46..

I also reinstalled Nvidia drivers to the latest, version 384.94

It's been running fine for the last six hours with temperatures on the cards between 66 and 69°C. I don't have any fans on them and the window slightly open. It's been 96°Fahrenheit out today. air-conditioners in the other room.

471 W total.

I guess I'll let it run through the night and see what happens. tomorrow add the sixth card off the 850 PSU.

So I'm not sure if was a riser or settings. I just hope it's working good now..

JaredKaragen

legendary

Activity: 1848

Merit: 1166

My AR-15 ID's itself as a toaster. Want breakfast?

I really have my $$$ on a faulty PSU.

I am not expeirencing my random freezes and restarts when my PSU temp is back down to a reasonable amount.... Those types of freezes and reboots really point towards the PSU.

Secondarily; test the ram with a testing software... but honestly, nowadays, power supplies cause all of the problems that memory used to back in the original SDRAM/1st gen DDR days.

igotfits

full member

Activity: 298

Merit: 100

Quote from: Elder III on July 28, 2017, 10:42:04 PM

I would suspect that either the PSU cannot handle the load and eventually craps out, or you may have a bad riser (possibly GPU instead). We have some PNY GPUs that will not go lower then 80% power limit in Afterburner, I believe they are the same ones you have. I can drag the slider past 80% but monitoring the power draw, both via software and via a killawatt meter at the wall shows that the GPUs still consumer ~80% power even if the slider is at 60 or 70%. I chalked it up to being a hardcoded into the bios issue and didn't investigate it any further. The temps and profits both have justified running them at 80% so I didn't worry about it (actually had forgotten until I read this thread).

i agree, im running 8 1060s on a 1000w sucking up 910watts and it runs fine on that psu. i suspect riser or gpu.

Elder III

sr. member

Activity: 1246

Merit: 274

I would suspect that either the PSU cannot handle the load and eventually craps out, or you may have a bad riser (possibly GPU instead). We have some PNY GPUs that will not go lower then 80% power limit in Afterburner, I believe they are the same ones you have. I can drag the slider past 80% but monitoring the power draw, both via software and via a killawatt meter at the wall shows that the GPUs still consumer ~80% power even if the slider is at 60 or 70%. I chalked it up to being a hardcoded into the bios issue and didn't investigate it any further. The temps and profits both have justified running them at 80% so I didn't worry about it (actually had forgotten until I read this thread).

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: philipma1957 on July 28, 2017, 08:03:47 PM

Quote from: joblo on July 28, 2017, 07:51:13 PM

It's power, the temps are fine. The 1060 is rated at 120W and you have 8 of them and the rest of the system powered by a 1000 W PSU.
Pull 2 cards from each rig and build a new rig with them or replace the PSUs with a pair of 1600 W beasts.

he has 1000 for 5 cards and 850 for the other three.

he said he could not turn down tdp lower then 80 so 80 x 120 = 960 watts with 1850 in watts available.

He need see why the cards only drop to 80% tdp

if he dropped them to 70% he would be 70 x 120 = 840 watts and he would not overheat. that extra 120 watts of power is too hard to cool off.

Oops, that's what I get for skim reading. Still 75C is not that hot, and an overheated GPU won't cause a system shutdown.
Either the power draw between the PSUs is unbalanced and overloading one of them or one is defective and can't handle
a normal load.

There are a few things that can be done to troubleshoot and isolate the suspect component. Split the rig and try running it
with 5 cards and the big PSU, then try running it with the other 3 cards only using the second PSU. Swap cards with the other
rig. Swap PSUs. Just move components around to see if the problem follows.

krazitrain

newbie

Activity: 15

Merit: 0

2.5 hours. must have been an error. windows retarted.

so that's where I've been with this rig.. It may run for two hours. It may run for four hours. It may run for one hour. But it won't run steady..

Thankfully the other rig is running like a champ.

krazitrain

newbie

Activity: 15

Merit: 0

I used kill a watt to get wattage reading from each power supply when it was mining and the 1000 PSU is at 700 W and the 850 PSU is at around 300 W. So 1000 total. I took those readings when I started without changing settings.

So now it's been running steady for two hours at power 80%, memory 400+. the two hottest cards are now about 73 and 74°C. I'm holding my breath.

It's still 80° Fahrenheit in that room I have the window open. 89° Fahrenheit outside..

philipma1957

legendary

Activity: 4354

Merit: 9201

'The right to privacy matters'

Quote from: joblo on July 28, 2017, 07:51:13 PM

It's power, the temps are fine. The 1060 is rated at 120W and you have 8 of them and the rest of the system powered by a 1000 W PSU.
Pull 2 cards from each rig and build a new rig with them or replace the PSUs with a pair of 1600 W beasts.

he has 1000 for 5 cards and 850 for the other three.

he said he could not turn down tdp lower then 80 so 80 x 120 = 960 watts with 1850 in watts available.

He need see why the cards only drop to 80% tdp

if he dropped them to 70% he would be 70 x 120 = 840 watts and he would not overheat. that extra 120 watts of power is too hard to cool off.

joblo

legendary

Activity: 1470

Merit: 1114

It's power, the temps are fine. The 1060 is rated at 120W and you have 8 of them and the rest of the system powered by a 1000 W PSU.
Pull 2 cards from each rig and build a new rig with them or replace the PSUs with a pair of 1600 W beasts.

JaredKaragen

legendary

Activity: 1848

Merit: 1166

My AR-15 ID's itself as a toaster. Want breakfast?

be sure its not the power supply overheating as well... use a laser thermometer to check.

I have a 1000w power supply that didn't like having my GTX980 right against it in the Fractal Designs Define R5 case.... and if it got to be pretty hot inside, it would randomly freeze or reboot.

Screwing a fan to the exhaust port in the back, sucking more air through it helped immensely, but ultimately, I removed that card form it that was butted right up against it and I havent had an issue yet.

krazitrain

newbie

Activity: 15

Merit: 0

I will try the suggestions.

I am also going to bring the rig out to the bigger room where it is a little cooler and see if that changes things..

I ccurrently have it set to 80 power right now with 400+ memory and I have two cards running hotter at 74, 75. The lowest are 64 where the fans are probably hitting them.I'm sure it will freeze pretty soon.

philipma1957

legendary

Activity: 4354

Merit: 9201

'The right to privacy matters'

if you can not lower tdp on msi afterburner something is wrong.

go to 1 card and see if the tdp slider works.

if it is stuck.

uninstall msi afterburner and uninstall nvidia drivers.

try to instal nvidia 382.33 with 1 card. try to install msi afterburner 4.3 with 1 card

see if slider works.

some advice never never never do 100% tdp.---- many will say i am wrong I simply don't care.

if the slider does not work with the one card.

try with it in every slot or riser just the one card.

if that card does not ever let the tdp slider work walk it out of the room try the next card i have found 1 or 5 or 6 or even 4 can be the entire issue and not using the one shit head card solves the issue. most of the time

igotfits

full member

Activity: 298

Merit: 100

Quote from: krazitrain on July 28, 2017, 06:28:02 PM

Quote from: igotfits on July 28, 2017, 06:08:14 PM

i really don't believe it's a temp thing, mid 60c is perfectly fine, im running 60-65c across all my cards.

now when you say the system dies? what happens? stutters? freezes? cant use it at all and a forced restart is needed?

The rig in the small room freezes. can't move the mouse sometimes. screen goes black and just have to power it down and force restart.

a couple cards run at 78 degrees, couple at 75 degrees and some near 72 degrees. depending on where i put big fan.

I've reinstalled drivers a few times.

i have the same board, fantastic board BTW with the latest drivers ofcourse. Those cards i believe are running hot, i dont like to see anything over 75c, damage doesnt start to kick in until your nearing 90c.
the slider alone is weird as heck! you should always have control of that thing no matter what, i kinda wanna say software issue but can also be hardware that isnt working properly.
You can start by lowering to the settings i sent you, then if that doesnt fix it, start removing card by card, then swapped risers. process of elimination.

krazitrain

newbie

Activity: 15

Merit: 0

Quote from: igotfits on July 28, 2017, 06:08:14 PM

i really don't believe it's a temp thing, mid 60c is perfectly fine, im running 60-65c across all my cards.

now when you say the system dies? what happens? stutters? freezes? cant use it at all and a forced restart is needed?

The rig in the small room freezes. can't move the mouse sometimes. screen goes black and just have to power it down and force restart.

a couple cards run at 78 degrees, couple at 75 degrees and some near 72 degrees. depending on where i put big fan.

I've reinstalled drivers a few times.

Topic: mining rig keeps dying. Too hot? (Read 1176 times)