Pages:
Author

Topic: mining rig keeps dying. Too hot? (Read 1144 times)

legendary
Activity: 1848
Merit: 1165
My AR-15 ID's itself as a toaster. Want breakfast?
August 02, 2017, 06:40:44 PM
#23
It may well be a sick card. well it is not sick, it just cannot handle your overclock. Even they are all the same model, one of them may not be able to survive the memory overclock. Actually this is what happened to my rig.

1. read the log and locate which gpu has hanged first or error first. say cpu 0.
2. what I did is pressing 0 to disable GPU 0 in the rig.
3. watch if your rig is stable. If it is, congratulation, you found the sick card.
4. Try one by one till you find it.

or you can just disable watchdog in your bat file, set -wd 0. In this case, the sick card will stop but your other cards keep working. Now touch the card and feel the temp you will know which one is bad.

If you cannot find a card to blame, then you may need to reinstall the windows and update to the latest.

Hope this helps.

Okay. I would like to do this but I am not sure where you disable just one of the GPU's.

Since my device 3 has been the one to stop first the last two times after reading my log, I am guessing that that is the problem.  I'm running Ali carts again now to see if I get that same device to fail. And trying to figure out how to disable the deviceand determine which card it actually is.

use nvidia inspector to find the GPU number in question.

what miner app are you using?

assuming ccminer, use the -d flag to choose what devices to use:
-d 0,1,3,4,5
and with ccminer you can trim intensity per-card as well:
-i 17,17,13,17,17
newbie
Activity: 15
Merit: 0
August 02, 2017, 05:00:16 PM
#22
It may well be a sick card. well it is not sick, it just cannot handle your overclock. Even they are all the same model, one of them may not be able to survive the memory overclock. Actually this is what happened to my rig.

1. read the log and locate which gpu has hanged first or error first. say cpu 0.
2. what I did is pressing 0 to disable GPU 0 in the rig.
3. watch if your rig is stable. If it is, congratulation, you found the sick card.
4. Try one by one till you find it.

or you can just disable watchdog in your bat file, set -wd 0. In this case, the sick card will stop but your other cards keep working. Now touch the card and feel the temp you will know which one is bad.

If you cannot find a card to blame, then you may need to reinstall the windows and update to the latest.

Hope this helps.

Okay. I would like to do this but I am not sure where you disable just one of the GPU's.

Since my device 3 has been the one to stop first the last two times after reading my log, I am guessing that that is the problem.  I'm running Ali carts again now to see if I get that same device to fail. And trying to figure out how to disable the deviceand determine which card it actually is.
newbie
Activity: 15
Merit: 0
August 02, 2017, 04:53:27 PM
#21
So it ran for 10 hours last night and then the devices fell off line.

I am mining zczsh. When it stops working correctly, it gives me the reading that 0 sols are being produced. The miner window is still up and all of the GPU's keep trying to restart but can't.

The last two times, the first device listed is, device 3. With the following log readings.

Device 3 thread exited with code:  77

Then all of the other GPU's exit with the same code 77.
sr. member
Activity: 504
Merit: 267
HashWare - Mining solutions for everyone!
August 02, 2017, 02:53:52 AM
#20
sorry i didnt read all replys. but i have also experienced the same symptoms. for me it was crashing because of too much OC and high temps. i then had to underclock manually every card that has caused the rig to crash. this is tidious work i know. but the only way i managed to make it stable.
member
Activity: 202
Merit: 10
Eloncoin.org - Mars, here we come!
August 02, 2017, 12:36:41 AM
#19
It may well be a sick card. well it is not sick, it just cannot handle your overclock. Even they are all the same model, one of them may not be able to survive the memory overclock. Actually this is what happened to my rig.

1. read the log and locate which gpu has hanged first or error first. say cpu 0.
2. what I did is pressing 0 to disable GPU 0 in the rig.
3. watch if your rig is stable. If it is, congratulation, you found the sick card.
4. Try one by one till you find it.

or you can just disable watchdog in your bat file, set -wd 0. In this case, the sick card will stop but your other cards keep working. Now touch the card and feel the temp you will know which one is bad.

If you cannot find a card to blame, then you may need to reinstall the windows and update to the latest.

Hope this helps.
legendary
Activity: 4172
Merit: 8075
'The right to privacy matters'
August 02, 2017, 12:08:12 AM
#18
Model Number:   VCGGTX10603XGPB-OC-BB   PNY GTX 1060 3GB XLR8 VERSION

Yesterday,  I tore it down so that they were just two cards running on the 1000 PSU. I ran it all day with a power limit of 70 and memory clock +476  individually on both cards.
I added the third card when I went to bed and woke up to a frozen screen again.  So, I replaced the riser on the third card..

Six hours ago, I attached two more cards to the 1000 PSU for 5 total. But this time I placed settings for all five at power limit 70, memory clock + 476, core +46..

I also  reinstalled  Nvidia drivers to the latest,  version     384.94

It's been running fine for the last six hours with temperatures  on the cards between 66 and 69°C. I don't have any fans on them and the window slightly open. It's been 96°Fahrenheit out today. air-conditioners in the other room.

471 W total.

I guess I'll let it run through the night and see what happens. tomorrow add the sixth card off the 850 PSU.

So I'm not sure if was a riser or settings. I just hope it's working good now..


So these five let you do 70 percent tdp and ran overnight.

That sixth card could be the issue.
I have had issues like this.

Ran the odd ball card in a one or two card rig and it works well.

newbie
Activity: 15
Merit: 0
August 01, 2017, 11:58:37 PM
#17
Model Number:   VCGGTX10603XGPB-OC-BB   PNY GTX 1060 3GB XLR8 VERSION

Yesterday,  I tore it down so that they were just two cards running on the 1000 PSU. I ran it all day with a power limit of 70 and memory clock +476  individually on both cards.
I added the third card when I went to bed and woke up to a frozen screen again.  So, I replaced the riser on the third card..

Six hours ago, I attached two more cards to the 1000 PSU for 5 total. But this time I placed settings for all five at power limit 70, memory clock + 476, core +46..

I also  reinstalled  Nvidia drivers to the latest,  version     384.94

It's been running fine for the last six hours with temperatures  on the cards between 66 and 69°C. I don't have any fans on them and the window slightly open. It's been 96°Fahrenheit out today. air-conditioners in the other room.

471 W total.

I guess I'll let it run through the night and see what happens. tomorrow add the sixth card off the 850 PSU.

So I'm not sure if was a riser or settings. I just hope it's working good now..
legendary
Activity: 1848
Merit: 1165
My AR-15 ID's itself as a toaster. Want breakfast?
July 29, 2017, 07:30:40 PM
#16
I really have my $$$ on a faulty PSU.

I am not expeirencing my random freezes and restarts when my PSU temp is back down to a reasonable amount....   Those types of freezes and reboots really point towards the PSU.


Secondarily;  test the ram with a testing software... but honestly, nowadays, power supplies cause all of the problems that memory used to back in the original SDRAM/1st gen DDR days.
full member
Activity: 298
Merit: 100
July 28, 2017, 11:45:31 PM
#15
I would suspect that either the PSU cannot handle the load and eventually craps out, or you may have a bad riser (possibly GPU instead). We have some PNY GPUs that will not go lower then 80% power limit in Afterburner, I believe they are the same ones you have. I can drag the slider past 80% but monitoring the power draw, both via software and via a killawatt meter at the wall shows that the GPUs still consumer ~80% power even if the slider is at 60 or 70%. I chalked it up to being a hardcoded into the bios issue and didn't investigate it any further. The temps and profits both have justified running them at 80% so I didn't worry about it (actually had forgotten until I read this thread).

i agree, im running 8 1060s on a 1000w sucking up 910watts and it runs fine on that psu. i suspect riser or gpu.
sr. member
Activity: 1246
Merit: 274
July 28, 2017, 11:42:04 PM
#14
I would suspect that either the PSU cannot handle the load and eventually craps out, or you may have a bad riser (possibly GPU instead). We have some PNY GPUs that will not go lower then 80% power limit in Afterburner, I believe they are the same ones you have. I can drag the slider past 80% but monitoring the power draw, both via software and via a killawatt meter at the wall shows that the GPUs still consumer ~80% power even if the slider is at 60 or 70%. I chalked it up to being a hardcoded into the bios issue and didn't investigate it any further. The temps and profits both have justified running them at 80% so I didn't worry about it (actually had forgotten until I read this thread).
legendary
Activity: 1470
Merit: 1114
July 28, 2017, 10:45:30 PM
#13
It's power, the temps are fine. The 1060 is rated at 120W and you have 8 of them and the rest of the system powered by a 1000 W PSU.
Pull 2 cards from each rig and build a new rig with them or replace the PSUs with a pair of 1600 W beasts.


he has 1000 for 5 cards and 850 for the other three.

he said he could not turn down tdp lower then 80    so 80 x 120 =  960 watts with 1850 in watts available.

He need see why the cards only drop to 80% tdp

if he dropped them to 70%   he would be 70 x 120 = 840 watts    and he would not overheat.  that extra 120 watts of power is too hard to cool off.

Oops, that's what I get for skim reading. Still 75C is not that hot, and an overheated GPU won't cause a system shutdown.
Either the power draw between the PSUs is unbalanced and overloading one of them or one is defective and can't handle
a normal load.

There are a few things that can be done to troubleshoot and isolate the suspect component. Split the rig and try running it
with 5 cards and the big PSU, then try running it with the other 3 cards only using the second PSU. Swap cards with the other
rig. Swap PSUs. Just move components around to see if the problem follows.
newbie
Activity: 15
Merit: 0
July 28, 2017, 10:35:04 PM
#12
2.5 hours.  must have been an error. windows retarted.

so that's where I've been with this rig.. It may run for two hours. It may run for four hours. It may run for one hour. But it won't run steady..

Thankfully the other rig is running like a champ.
newbie
Activity: 15
Merit: 0
July 28, 2017, 10:17:54 PM
#11
I used kill a watt to get wattage reading from each power supply when it was mining and the 1000 PSU is  at 700 W and the 850 PSU is at around 300 W. So 1000 total. I took those readings when I started without changing settings.

So now it's been running steady for two hours at power 80%, memory 400+. the two hottest cards are now about 73 and 74°C. I'm holding my breath.

It's still 80° Fahrenheit in that room I have the window open. 89° Fahrenheit outside..
legendary
Activity: 4172
Merit: 8075
'The right to privacy matters'
July 28, 2017, 09:03:47 PM
#10
It's power, the temps are fine. The 1060 is rated at 120W and you have 8 of them and the rest of the system powered by a 1000 W PSU.
Pull 2 cards from each rig and build a new rig with them or replace the PSUs with a pair of 1600 W beasts.


he has 1000 for 5 cards and 850 for the other three.

he said he could not turn down tdp lower then 80    so 80 x 120 =  960 watts with 1850 in watts available.

He need see why the cards only drop to 80% tdp

if he dropped them to 70%   he would be 70 x 120 = 840 watts    and he would not overheat.  that extra 120 watts of power is too hard to cool off.
legendary
Activity: 1470
Merit: 1114
July 28, 2017, 08:51:13 PM
#9
It's power, the temps are fine. The 1060 is rated at 120W and you have 8 of them and the rest of the system powered by a 1000 W PSU.
Pull 2 cards from each rig and build a new rig with them or replace the PSUs with a pair of 1600 W beasts.
legendary
Activity: 1848
Merit: 1165
My AR-15 ID's itself as a toaster. Want breakfast?
July 28, 2017, 08:29:39 PM
#8
be sure its not the power supply overheating as well... use a laser thermometer to check.

I have a 1000w power supply that didn't like having my GTX980 right against it in the Fractal Designs Define R5 case.... and if it got to be pretty hot inside, it would randomly freeze or reboot.

Screwing a fan to the exhaust port in the back, sucking more air through it helped immensely, but ultimately, I removed that card form it that was butted right up against it and I havent had an issue yet.
newbie
Activity: 15
Merit: 0
July 28, 2017, 07:56:25 PM
#7
I will try the suggestions.

I am also going to bring the rig out to the bigger room where it is a little cooler and see if that changes things..


I ccurrently have it set to 80 power right now with 400+ memory  and I have two cards running hotter at 74, 75. The lowest are 64 where the fans are probably hitting them.I'm sure it will freeze pretty soon.
legendary
Activity: 4172
Merit: 8075
'The right to privacy matters'
July 28, 2017, 07:34:17 PM
#6
if you can not lower tdp on msi afterburner something is wrong.

go to 1 card and see if the tdp slider works.

if it is stuck.

uninstall msi afterburner and uninstall nvidia drivers.

try to instal nvidia 382.33 with 1 card.  try to install msi afterburner 4.3 with 1 card

see if slider works.

some advice never never never do 100% tdp.---- many will say i am wrong  I simply don't care.

if the slider does not work with the one card.

try with it in every slot or riser just the one card.

if that card does not ever let the tdp slider work  walk it out of the room try the next card i have found 1 or 5 or 6 or even 4 can be the entire issue and not using the one shit head card solves the issue.  most of the time
full member
Activity: 298
Merit: 100
July 28, 2017, 07:33:44 PM
#5
i really don't believe it's a temp thing, mid 60c is perfectly fine, im running 60-65c across all my cards.

now when you say the system dies? what happens? stutters? freezes? cant use it at all and a forced restart is needed?


The rig in the small room freezes. can't move the mouse sometimes. screen goes black and just have to power it down and force restart.

a couple cards run at 78 degrees, couple at 75 degrees and some near 72 degrees. depending on where i put big fan.

I've reinstalled drivers a few times.

i have the same board, fantastic board BTW with the latest drivers ofcourse. Those cards i believe are running hot, i dont like to see anything over 75c, damage doesnt start to kick in until your nearing 90c.
the slider alone is weird as heck! you should always have control of that thing no matter what, i kinda wanna say software issue but can also be hardware that isnt working properly.
You can start by lowering to the settings i sent you, then if that doesnt fix it, start removing card by card, then swapped risers. process of elimination.
newbie
Activity: 15
Merit: 0
July 28, 2017, 07:28:02 PM
#4
i really don't believe it's a temp thing, mid 60c is perfectly fine, im running 60-65c across all my cards.

now when you say the system dies? what happens? stutters? freezes? cant use it at all and a forced restart is needed?


The rig in the small room freezes. can't move the mouse sometimes. screen goes black and just have to power it down and force restart.

a couple cards run at 78 degrees, couple at 75 degrees and some near 72 degrees. depending on where i put big fan.

I've reinstalled drivers a few times.
Pages:
Jump to: