Author

Topic: I made an auto-coldrebooter script for BAMT/SMOS (Read 1782 times)

newbie
Activity: 63
Merit: 0
I just bit improved Rlutz original script and added some more features https://bitcointalksearch.org/topic/scriptmass-miners-control-for-bamtsmos-linux-508355 
SICK/DEAD cards status also monitored via cgminer API.
full member
Activity: 378
Merit: 102
This script has saved me a couple trips home from the office to check on my rig.


Now to figure out why it's crapping out after smooth sailing for weeks.
newbie
Activity: 37
Merit: 0
It does log which GPU had the low temp


Just re-read the code and noticed that Wink

Great!
newbie
Activity: 11
Merit: 0
It does log which GPU had the low temp
newbie
Activity: 37
Merit: 0
Would be awesome if the script before issuing the shutdown would write to a log file which gpu had the lower temperature and what time, etc...

Anyway, great script!
newbie
Activity: 11
Merit: 0
I improved the script yet again because I got tired of viewgpu's readings being unreliable.

It now goes off your card temperatures, which seems to be very reliable (a card that is sick or dead will quickly cool down to idle temps).

So far this hasn't failed me!
newbie
Activity: 11
Merit: 0
When you run it manually, that process should be dead, it just means it completed successfully. Basically there's a watchdog that only gives the viewgpu command a few seconds to complete (otherwise a stuck viewgpu command will make the script just wait for it to finish indefinitely, but if viewgpu runs normally, when kill goes to kill the process, it will have already terminated successfully).

If you weren't able to SSH in, it's possible that the Kernel just blew up? At that point, nothing will help you, not even Linux magic keys. You'll have to power button if the kernel explodes.

If you ever see a situation where it doesn't reboot and a card is dead (and you can still actually interact with the system) run the viewgpu command and give the output back.

But yeah, a malfunctioning Linux kernel, whether it's a kernel panic or some other kernel explosion means that your only likely resolution is actually powering it off.
full member
Activity: 378
Merit: 102
I had my GPUs stop spinning last night and the rig halted, it didn't reboot though and I'm running the script.

Here is what I see when manually running it, I'm using SMOS 1.3

/root/autoRebooter.sh: line 11: kill: (10583) - No such process


Hmmm, that line means that the viewgpu process likely completed just fine. Next time that happens, can you run viewgpu and paste me the output? It's possible that there is some failure output that the script doesn't check for yet.

Some people have mentioned that coldreboot fails sometimes, perhaps I should just remove the /sbin/coldreboot and always just use Linux magic keys (echo s, echo b, etc).

I see the same message whenever I run it manually. When the rig went down, I wasn't able to SSH into it so couldn't tell the status.

root@smos-1:~# /opt/bamt/viewgpu
0: 71.0c 480.00 Mh/s http://eu.betarigs.com:3333
1: 68.0c 480.00 Mh/s http://eu.betarigs.com:3333
2: 69.0c 480.00 Mh/s http://eu.betarigs.com:3333
newbie
Activity: 11
Merit: 0
I had my GPUs stop spinning last night and the rig halted, it didn't reboot though and I'm running the script.

Here is what I see when manually running it, I'm using SMOS 1.3

/root/autoRebooter.sh: line 11: kill: (10583) - No such process


Hmmm, that line means that the viewgpu process likely completed just fine. Next time that happens, can you run viewgpu and paste me the output? It's possible that there is some failure output that the script doesn't check for yet.

Some people have mentioned that coldreboot fails sometimes, perhaps I should just remove the /sbin/coldreboot and always just use Linux magic keys (echo s, echo b, etc).
full member
Activity: 378
Merit: 102
I had my GPUs stop spinning last night and the rig halted, it didn't reboot though and I'm running the script.

Here is what I see when manually running it, I'm using SMOS 1.3

/root/autoRebooter.sh: line 11: kill: (10583) - No such process
newbie
Activity: 11
Merit: 0
One last edit to clean things up. Currently it deals with any edge cases I've seen come up (coldreboot fails, viewgpu hangs, etc).
newbie
Activity: 11
Merit: 0
I edited the script in the OP to take care of some additional edge cases where viewgpu command might hang indefinitely.
sr. member
Activity: 266
Merit: 250
thanks very useful data.
newbie
Activity: 11
Merit: 0
Some people have mentioned that sometimes coldreboot can fail, while I've never had this problem, you could try using Linux magic keys as a failover if coldreboot fails, just sleep for 30 or so then

Code:
#after the /sbin/coldreboot line
sleep 30
echo s > /proc/sysrq-trigger
sleep 10
echo b > /proc/sysrq-trigger
legendary
Activity: 952
Merit: 1000
Trying this out as well. Thanks!
full member
Activity: 378
Merit: 102
I'm running it, but don't see too many hardware issues. I will check back if it happens.
newbie
Activity: 3
Merit: 0
This will be very useful but my BAMT 1.3 does not have corntab installed.

Did you had to install it?

edit: nevermind.. I got it working! Thanks.
newbie
Activity: 37
Merit: 0
Can anyone confirm if this works?
newbie
Activity: 11
Merit: 0
I just got my rig setup a few days ago, and the one thing that really bugged me, especially when I was playing around with different settings, is that if a card was unstable and crashed, the only way to fix it was to coldreboot.

It's possible that you could have a card that might only crash once every few days, but at that point you have to issue a coldreboot. What if you're away from computer? Think of all those precious coins you could be losing!

Anyway, this script will take care of that for you. I'm using LTCrabbit's customized SMOS-Linux, but it should work for any other similar distro.

First we'll need to make a script. You can use nano or vim or whatever you prefer, I'll write the tutorial using nano since if you're a Linux newbie it's probably the easiest way to go. Fire up a root terminal, then

Code:
nano /root/autoRebooter.sh

Paste the following contents into that file (make sure to edit your targetMinTemp accordingly!!!):
Code:
#!/bin/bash

#Set your targeted minimum temp here, system will issue a cold
#reboot if a card temp falls below this number
targetMinTemp=50
i=0
(/opt/bamt/viewgpu | awk '{ print $2; }' | cut -c -2 > /tmp/viewgpu) & pid=$!
echo $pid
(sleep 10 && kill $pid)
sleep 15
array=(`cat /tmp/viewgpu`)
if [ ${#array[@]} -eq 0 ]; then
  echo "`date +%m-%d-%Y` `uptime | awk -F, '{sub(".*ge ",x,$1);print $1}'` viewgpu command failed to run, rebooting" >>  /home/$(grep '1000' /etc/passwd | cut -d ':' -f 1)/autoRebooter.log
  /sbin/coldreboot &
  sleep 30
  echo s > /proc/sysrq-trigger
  sleep 10
  echo b > /proc/sysrq-trigger
fi
for temp in ${array[@]}; do
  if [ $temp -lt $targetMinTemp ]; then
    echo "`date +%m-%d-%Y` `uptime | awk -F, '{sub(".*ge ",x,$1);print $1}'` card number $i has stopped, its current temp is $temp, coldrebooting" >> /home/$(grep '1000' /etc/passwd | cut -d ':' -f 1)/autoRebooter.log
    /sbin/coldreboot &
    sleep 30
    echo s > /proc/sysrq-trigger
    sleep 10
    echo b > /proc/sysrq-trigger
  fi
i=$(($i+1))
done
    

Use ctrl+o to write the file out, then ctrl+x to exit nano.

Next you'll need to make the script executable

Code:
chmod a+x /root/autoRebooter.sh

Lastly, we'll need to add a cronjob to periodically check in. I set it to run every hour.

Code:
crontab -e

Add the following line to the end of crontab

Code:
0 */1 * * * /root/autoRebooter.sh

ctrl+o to write it out, ctrl+x to save it.

There you go, now you never have to worry about a crashed GPU bringing down your hashrate ever again!

If you found this helpful, I'm currently at a whopping 2.3 LTC and 0.01 BTC would love to have a few fractions more!

LTC: Lhb3yJGPL9dsUZ2tt5KrbNMm3pVmmA1fkb

BTC: 1NKkGEsY5UwkzSmD63yBcJj9hkrS4YWsbX

edit: I've made improvements incase viewgpu gets stuck or coldreboot fails, tested and verified to work!
Jump to: