I made an auto-coldrebooter script for BAMT/SMOS

Demontager

jr. member

Activity: 71

Merit: 6

I just bit improved Rlutz original script and added some more features https://bitcointalksearch.org/topic/scriptmass-miners-control-for-bamtsmos-linux-508355
SICK/DEAD cards status also monitored via cgminer API.

xbudahx

full member

Activity: 378

Merit: 102

This script has saved me a couple trips home from the office to check on my rig.

Now to figure out why it's crapping out after smooth sailing for weeks.

silvetti

newbie

Activity: 37

Merit: 0

Quote from: RLutz on March 03, 2014, 10:41:35 AM

It does log which GPU had the low temp

Just re-read the code and noticed that Wink

Great!

RLutz

newbie

Activity: 11

Merit: 0

It does log which GPU had the low temp

silvetti

newbie

Activity: 37

Merit: 0

Would be awesome if the script before issuing the shutdown would write to a log file which gpu had the lower temperature and what time, etc...

Anyway, great script!

RLutz

newbie

Activity: 11

Merit: 0

I improved the script yet again because I got tired of viewgpu's readings being unreliable.

It now goes off your card temperatures, which seems to be very reliable (a card that is sick or dead will quickly cool down to idle temps).

So far this hasn't failed me!

RLutz

newbie

Activity: 11

Merit: 0

When you run it manually, that process should be dead, it just means it completed successfully. Basically there's a watchdog that only gives the viewgpu command a few seconds to complete (otherwise a stuck viewgpu command will make the script just wait for it to finish indefinitely, but if viewgpu runs normally, when kill goes to kill the process, it will have already terminated successfully).

If you weren't able to SSH in, it's possible that the Kernel just blew up? At that point, nothing will help you, not even Linux magic keys. You'll have to power button if the kernel explodes.

If you ever see a situation where it doesn't reboot and a card is dead (and you can still actually interact with the system) run the viewgpu command and give the output back.

But yeah, a malfunctioning Linux kernel, whether it's a kernel panic or some other kernel explosion means that your only likely resolution is actually powering it off.

xbudahx

full member

Activity: 378

Merit: 102

Quote from: RLutz on March 01, 2014, 01:42:30 PM

Quote from: xbudahx on March 01, 2014, 10:45:40 AM

I had my GPUs stop spinning last night and the rig halted, it didn't reboot though and I'm running the script.

Here is what I see when manually running it, I'm using SMOS 1.3

/root/autoRebooter.sh: line 11: kill: (10583) - No such process

Hmmm, that line means that the viewgpu process likely completed just fine. Next time that happens, can you run viewgpu and paste me the output? It's possible that there is some failure output that the script doesn't check for yet.

Some people have mentioned that coldreboot fails sometimes, perhaps I should just remove the /sbin/coldreboot and always just use Linux magic keys (echo s, echo b, etc).

I see the same message whenever I run it manually. When the rig went down, I wasn't able to SSH into it so couldn't tell the status.

root@smos-1:~# /opt/bamt/viewgpu
0: 71.0c 480.00 Mh/s http://eu.betarigs.com:3333
1: 68.0c 480.00 Mh/s http://eu.betarigs.com:3333
2: 69.0c 480.00 Mh/s http://eu.betarigs.com:3333

RLutz

newbie

Activity: 11

Merit: 0

Quote from: xbudahx on March 01, 2014, 10:45:40 AM

I had my GPUs stop spinning last night and the rig halted, it didn't reboot though and I'm running the script.

Here is what I see when manually running it, I'm using SMOS 1.3

/root/autoRebooter.sh: line 11: kill: (10583) - No such process

Hmmm, that line means that the viewgpu process likely completed just fine. Next time that happens, can you run viewgpu and paste me the output? It's possible that there is some failure output that the script doesn't check for yet.

Some people have mentioned that coldreboot fails sometimes, perhaps I should just remove the /sbin/coldreboot and always just use Linux magic keys (echo s, echo b, etc).

xbudahx

full member

Activity: 378

Merit: 102

I had my GPUs stop spinning last night and the rig halted, it didn't reboot though and I'm running the script.

Here is what I see when manually running it, I'm using SMOS 1.3

/root/autoRebooter.sh: line 11: kill: (10583) - No such process

RLutz

newbie

Activity: 11

Merit: 0

One last edit to clean things up. Currently it deals with any edge cases I've seen come up (coldreboot fails, viewgpu hangs, etc).

RLutz

newbie

Activity: 11

Merit: 0

I edited the script in the OP to take care of some additional edge cases where viewgpu command might hang indefinitely.

hostmaster

sr. member

Activity: 266

Merit: 250

thanks very useful data.

RLutz

newbie

Activity: 11

Merit: 0

Some people have mentioned that sometimes coldreboot can fail, while I've never had this problem, you could try using Linux magic keys as a failover if coldreboot fails, just sleep for 30 or so then

Code:

#after the /sbin/coldreboot line
sleep 30
echo s > /proc/sysrq-trigger
sleep 10
echo b > /proc/sysrq-trigger

crazyates

legendary

Activity: 952

Merit: 1000

Trying this out as well. Thanks!

xbudahx

full member

Activity: 378

Merit: 102

I'm running it, but don't see too many hardware issues. I will check back if it happens.

SR0G

newbie

Activity: 3

Merit: 0

This will be very useful but my BAMT 1.3 does not have corntab installed.

Did you had to install it?

edit: nevermind.. I got it working! Thanks.

Okilo

newbie

Activity: 37

Merit: 0

Can anyone confirm if this works?

RLutz

newbie

Activity: 11

Merit: 0

I just got my rig setup a few days ago, and the one thing that really bugged me, especially when I was playing around with different settings, is that if a card was unstable and crashed, the only way to fix it was to coldreboot.

It's possible that you could have a card that might only crash once every few days, but at that point you have to issue a coldreboot. What if you're away from computer? Think of all those precious coins you could be losing!

Anyway, this script will take care of that for you. I'm using LTCrabbit's customized SMOS-Linux, but it should work for any other similar distro.

First we'll need to make a script. You can use nano or vim or whatever you prefer, I'll write the tutorial using nano since if you're a Linux newbie it's probably the easiest way to go. Fire up a root terminal, then

Code:

nano /root/autoRebooter.sh

Paste the following contents into that file (make sure to edit your targetMinTemp accordingly!!!):

Code:

#!/bin/bash

#Set your targeted minimum temp here, system will issue a cold
#reboot if a card temp falls below this number
targetMinTemp=50
i=0
(/opt/bamt/viewgpu | awk '{ print $2; }' | cut -c -2 > /tmp/viewgpu) & pid=$!
echo $pid
(sleep 10 && kill $pid)
sleep 15
array=(`cat /tmp/viewgpu`)
if [ ${#array[@]} -eq 0 ]; then
echo "`date +%m-%d-%Y` `uptime | awk -F, '{sub(".*ge ",x,$1);print $1}'` viewgpu command failed to run, rebooting" >> /home/$(grep '1000' /etc/passwd | cut -d ':' -f 1)/autoRebooter.log
/sbin/coldreboot &
sleep 30
echo s > /proc/sysrq-trigger
sleep 10
echo b > /proc/sysrq-trigger
fi
for temp in ${array[@]}; do
if [ $temp -lt $targetMinTemp ]; then
echo "`date +%m-%d-%Y` `uptime | awk -F, '{sub(".*ge ",x,$1);print $1}'` card number $i has stopped, its current temp is $temp, coldrebooting" >> /home/$(grep '1000' /etc/passwd | cut -d ':' -f 1)/autoRebooter.log
/sbin/coldreboot &
sleep 30
echo s > /proc/sysrq-trigger
sleep 10
echo b > /proc/sysrq-trigger
fi
i=$(($i+1))
done

Use ctrl+o to write the file out, then ctrl+x to exit nano.

Next you'll need to make the script executable

Code:

chmod a+x /root/autoRebooter.sh

Lastly, we'll need to add a cronjob to periodically check in. I set it to run every hour.

Code:

crontab -e

Add the following line to the end of crontab

Code:

0 */1 * * * /root/autoRebooter.sh

ctrl+o to write it out, ctrl+x to save it.

There you go, now you never have to worry about a crashed GPU bringing down your hashrate ever again!

If you found this helpful, I'm currently at a whopping 2.3 LTC and 0.01 BTC would love to have a few fractions more!

LTC: Lhb3yJGPL9dsUZ2tt5KrbNMm3pVmmA1fkb

BTC: 1NKkGEsY5UwkzSmD63yBcJj9hkrS4YWsbX

edit: I've made improvements incase viewgpu gets stuck or coldreboot fails, tested and verified to work!

Topic: I made an auto-coldrebooter script for BAMT/SMOS (Read 1792 times)