Author

Topic: Rig Crashing and Won't Restart (Read 1155 times)

vip
Activity: 756
Merit: 503
June 21, 2012, 08:14:02 AM
#11
Had same problem and lowered my OC settings.
donator
Activity: 686
Merit: 519
It's for the children!
June 21, 2012, 08:05:20 AM
#10
You said you were using BAMT? On a USB drive or hard drive?  I would suggest first using a new copy of BAMT as sometimes the USB or HDD images get curropt.

Also ensure you have temperature cutoffs defined, your rig will crash (and burn) if the cards get overheated.

There are a few million reasons for a kernel panic.  On  a mining rig the options are significantly reduced.  Try a new BAMT image and then list your Hardware:

Motherboard:
CPU:
Memory:
Cards:
Extenders or on-board:
Bamt Version:
full member
Activity: 182
Merit: 100
June 20, 2012, 11:02:43 PM
#9
You said you have Dual PSUs?  If so try running on one PSU - dual PSU's can cause lockups in linux if your not syncing/protecting the board from spikes.  Cablesauras and others have adapters to protect the rigs from this.

You can prevent temp lockups by setting a cutoff threshold in BAMT.

You should monitor all of your miners (even just one) using SNMP (included with bamt) or cacti.  I graph all of mine and send SMS to myself and my tech whenever production drops by more than 10%

Are you using extenders or are the cards directly on the board?  Which version of BAMT?

I have switched to 1 PSU, with only 2 cards on each motherboard. Now it's a little more stable, but still rebooting every now and then.

I once saw a reboot as I had the monitor plugged at that time (most of the time the it's running headless) and I saw on the screen an error related to Kernel Panic. How can I troubleshoot this to find the source of the instability?
donator
Activity: 686
Merit: 519
It's for the children!
June 20, 2012, 03:14:08 PM
#8
You said you have Dual PSUs?  If so try running on one PSU - dual PSU's can cause lockups in linux if your not syncing/protecting the board from spikes.  Cablesauras and others have adapters to protect the rigs from this.

You can prevent temp lockups by setting a cutoff threshold in BAMT.

You should monitor all of your miners (even just one) using SNMP (included with bamt) or cacti.  I graph all of mine and send SMS to myself and my tech whenever production drops by more than 10%

Are you using extenders or are the cards directly on the board?  Which version of BAMT?
full member
Activity: 182
Merit: 100
June 18, 2012, 02:25:01 AM
#7
it might be better to really power the system off instead of rebooting - I noticed that when I "killed" a card by overclocking (seeking the highest stable speed) that a reboot did not suffice for the card to recover ...

That's a good idea. I noticed something similar that if a card if not detected for whatever reason, a reboot won't solve the issue. And shutting it down the then turning on again could make the card get detected.

I still haven't got a fix on the root cause of the issues I'm having, I'm still not sure it's the card that's getting locked or it's a motherboard that somehow remember the card/slot tha failed and won't let me boot up until I remove/replace that card.
newbie
Activity: 40
Merit: 0
June 17, 2012, 08:52:49 PM
#6
it might be better to really power the system off instead of rebooting - I noticed that when I "killed" a card by overclocking (seeking the highest stable speed) that a reboot did not suffice for the card to recover ...
full member
Activity: 182
Merit: 100
June 17, 2012, 06:42:54 PM
#5
I removed the 1st video card and it booted up normally.

I also noticed that the 1st and 2nd video cards locks more often, they're also drawing their air from the next video cards so they have the highest temperature of all cards in the rig.

So the lock is probably happening because of the high temperatures, right? I think I will be able to avoid it by adding extra fans.

The other question is how can I quickly recover from a locked card? How to quickly unlock it and could it be automated instead of requiring manual work such as manually testing which card is locked and removing it from the rig?

And another important question, why is BAMT restarting the system? It doesn't seem a good idea to restart to me since the system won't come back up at all and sit there for hours until I notice the problem and manually fix it.
newbie
Activity: 40
Merit: 0
June 17, 2012, 06:11:06 PM
#4
I can't tell you why it crashes ... but I experienced the same problem at booting machines with 2 or more ATI cards.

Linux 64bit, any version of fglrx ... sometimes the machine locks up hard when it tries to start X ... next try may succeed or not ... whenever I get it to start up I am sure to avoid shutting down as long as possible ...
full member
Activity: 182
Merit: 100
June 17, 2012, 05:22:02 AM
#3
I'm using 2x PSU model BeQuiet Pro-8 1200W, so I have a total of 2400W worth of juice.

When it locks, even if I disconnect all cards and leave only 1 connected being the locked one the system still won't start.

I believe it hangs when it is starting the X, I get a blank screen and nothing happens beyond that point. The system is also not accessible via SSH, it pretty much freezes.
zvs
legendary
Activity: 1680
Merit: 1000
https://web.archive.org/web/*/nogleg.com
June 17, 2012, 05:17:13 AM
#2
I've assembled my first rig, it's a no brand motherboard with 4 video cards (3x 5970 + 1 5870).

Running it with BAMT it's yielding about 2.2GHps as of now, still pending some overclock tuning.

My problem is every couple of hours the rig stops working and won't restart. It seems from my limited knowledge perspective that some of the cards are locking and keeping the whole machine from booting up.

So usually I will remove each card at a time until the machine successfully boots.

How can I "unlock" a video card? And better how can I prevent these locks?

are you sure it's not the power supply?  that's what it sounds like to me
full member
Activity: 182
Merit: 100
June 17, 2012, 04:42:14 AM
#1
I've assembled my first rig, it's a no brand motherboard with 4 video cards (3x 5970 + 1 5870).

Running it with BAMT it's yielding about 2.2GHps as of now, still pending some overclock tuning.

My problem is every couple of hours the rig stops working and won't restart. It seems from my limited knowledge perspective that some of the cards are locking and keeping the whole machine from booting up.

So usually I will remove each card at a time until the machine successfully boots.

How can I "unlock" a video card? And better how can I prevent these locks?
Jump to: