Author

Topic: UNATTENDED RIG: Automatic overclocking,FANs control,mail sending,HW safe guarded (Read 2259 times)

jr. member
Activity: 56
Merit: 2

big up ,

some one have a script like this for win10 / nvidia ?

im looking more for a crash guard , spying miner work
when error reported from miner / log,
check wich card made error , underclock it a bit & reboot card / miner
newbie
Activity: 8
Merit: 0
Nice work NEICH, ideal to prevent temperature peaks in hot days, will try on summer!
newbie
Activity: 13
Merit: 0


DEAR BTCITCOINERS:

I am very glad to share with you my big effort carried out during several months together with my dear mining RIG provided of 3 ATI Radeon HD 5870.
This consists on an unattended and automatic controlled Mining RIG, that has provided a lot of benefits for me (hardware safe guard, automatic settings basing on temperature, Fans control, mail sending, etc), I hope it will be useful for others.
Of course I know that mining with GPUs is not really profitable now, but I am also waiting for an ASIC with 30 GH/s!!  Grin. My idea is to adapt all these scripts asapfor the mining with ASIC, and share it with you in case I see you consider this post valuable for you.

Note that I provide all this effort absolutely free!, however I would really grateful of receiving donations to my BTC address. That would encourage me for publishing other ideas and code with the BTC community.
(You can find my address at the end of the post).

First of all, let me indicate the benefits obtained from using these scripts:
  • All scripts are automatically started when switching on the system
  • All scripts can be  manually started by a single command
  • All scripts can be manually stopped by a single command
  • There is a control script in charge of the following actions:
    • Automatic overclocking of each GPUs basing on the GPU temperature, target temperature, etc
    • Automatic downclocking of each GPUs basing on the GPU temperature, target temperature, etc
    • Automatic starting of mining processes in the case they are abnormally stopped. (5 retries)
    • Sending of mails if the retry limit is reached, in order to inform of the problem to the user
    • Automatic shutting down of the mining processes in the case the temperature is very high. This will prevent a worse hardware failure in the GPUs
    • Automatic starting of the mining processes when the temperature is already safe for starting ythe mining
    • Automatic FAN control for each GPU from 30% to 100% of speed based on the temperature
  • There is a monitor script continuously sensing all system parameters for user supervision:
    • Current GPUs clocks 
    • Current FAN clocks 
    • Current temperature for each GPU
    • Temperatures of CPU, Motherboard, etc
    • Current hashrate for each GPU
    • Last 10 control scripts actions
  • The monitor script can be used thought ssh or telnet connection. In my case, I supervise my RIG though SSH from my mobile by using 'ConnectBot' application (see android or iPhone markets)
 
Before using the scripts, have a look to base system. Your system can be different to this, but I provide you all the code so that you can add any correction for adapting all to your system. The base system for running these scripts are the following:
  • Linux OS (I use Linux debian). All the scripts are coded on Linux shell.
  • ATI graphic cards (I have 3 ATI HD 5870). Note that I use 'aticonfig' software for almost everything
  • ATI drivers already installed, AMD SDK, etc. These guides were very useful for me:
  • lm-sensors package already installed (use apt-get or aptitude). This is for retrieving the CPU and motherboard temperatures
  • Screen package pre-installed (apt-get install screen). This tool will let us to run all scripts from the GUI, but sharing the console output with other sessions. This will let us monitor the system from telnet or SSH, as the console is shared (thanks to screen!!) .
  • I use poclbm.py for GPU mining. Other mining software will be ok, but you will need to change the related piece of code from the scripts
  • My system is fully unattended... so I also have installed VNC software for remote connection to the desktop (I have there my wallet)
  • I also have an automatic login in the system, so that when I switch on the system, the system is logged in, the scripts are automatically started and the wallet application is also started, getting updated with the BTC network transactions. (In my case, the RIG is installed in a cooled place very far from my house)
  • This script uses "reboot" and "halt" command without sudo password. To get it, read this link: http://sleekmason.wordpress.com/fluxbox/using-etcsudoers-to-allow-shutdownrestart-without-password/
    • I use "mail" command for sending mails. mail command must be available for the script (look for it on the internet).



    OK, after this brief introduction, let's go to the scripts. I wish you enjoy them!!


    I have the following little scripts for starting each of my individual mining process on my GPUs:

    gpu0.sh
    Code:
    #!/bin/bash
    export DISPLAY=:0.0
    cd /home/your_path/scripts
    DISPLAY=:0.0 aticonfig --pplib-cmd "set fanspeed 0 75"
    DISPLAY=:0 ./poclbm.py -d0 -v -r 5 -w128 http://[email protected]:[email protected]:8332 | tee mining_gpu0.log

    As you can see, I use a pool for mining (deepbit). You should change this line with your specific parameters for poclbm.py or other mining software.
    Note that I pipe the output to the 'tee' command in order to store the output of the process in a log file called "mining_gpu0.log". This will be useful later for monitoring the script output, as we will retrieve from these logs files the hashrate for each GPU.
    Note the parameter -r 5: This will make that the poclbm.py script will update the output (hashrate) one per each 5 seconds. The reason for this will be discussed later...

    You should create additional gpux.sh files, one for each GPU. In my case I have 3 GPUs, so I have these additional scripts:

    gpu1.sh
    Code:
    #!/bin/bash
    export DISPLAY=:0.1
    cd /home/your_path/scripts
    DISPLAY=:0.1 aticonfig --pplib-cmd "set fanspeed 0 75"
    DISPLAY=:0 ./poclbm.py -d0 -v -r 5 -w128 http://[email protected]:[email protected]:8332 | tee mining_gpu1.log


    gpu2.sh
    Code:
    #!/bin/bash
    export DISPLAY=:0.2
    cd /home/your_path/scripts
    DISPLAY=:0.2 aticonfig --pplib-cmd "set fanspeed 0 75"
    DISPLAY=:0 ./poclbm.py -d0 -v -r 5 -w128 http://[email protected]:[email protected]:8332 | tee mining_gpu2.log

    Is I said, I use the "screen" linux tool for sharing the outputs of the commands running. For example, we can share the script gpu0.sh in a shared console with the following command:
    Code:
    /usr/bin/screen -admS gpu0 ./gpu0.sh

    This will create a shared console called "gpu0", that can be accessible though telnet or ssh with the following command:
    Code:
    screen -x gpu0
    Therefore, we can watch the output of the execution of gpu0.sh. Note that for exiting in a shared console you have to use 'CTRL+A' and 'D' (to get detached of the shared console). Otherwise, you can stop the execution of gpu0.sh in that console.

    Now, we can define the script start.sh that will launch the mining scripts:

    start.sh
    Code:
    #!/bin/bash

    cd /home/your_path/scripts
    echo Starting mining scripts...
    /usr/bin/screen -admS gpu0 ./gpu0.sh
    /usr/bin/screen -admS gpu1 ./gpu1.sh
    /usr/bin/screen -admS gpu2 ./gpu2.sh
    ...

    Now, you can add this script (/home/your_path/start.sh) to your startup programs group. You can easily do it from from the system menu.

    The control script has a lot of features, It is full of comments so I expect you have enough information there.
    control.sh
    Code:
    #!/bin/bash
    #---times constants
    control_time=5       # Time cycle between control loops (5 seconds)
    overclock_delay=180  # waiting time between overclocking commands (It is multiplier of control_time (180*5 = 15 min)
    downclock_delay=60   # waiting time between downclocking commands (60*5 = 5 min)
    downclock_urgent=24  # waiting time between urgent downclocking commands (24*5 = 2 min)
    timeCounter=0        # time counter

    #---GPUs temperatures
    target_temp=75      # Target temp for the GPUs. Automatic Overclocking/downclocking will be performed for reaching this temperature as maximum.
    hightemp_alarm=80   # Alarm temperature: If exceeded, it will be performed an urgent downclocking
    maxtemp_stop=83     # maximum temperature in the GPUs: The mining process will be stopped for security resons.
    temp_recover=65     # recovery temperature: After a mining stop due to high temperature, when this safe temperature is reached, the mining is already started.
    control_gap=3       # Temperature below target_temp that is needed to be exceeded for an overclock command. (If current temperature is very near from the target temp, overclockin is not performed... we maintain the temp. near but below the limit) 

    #---CPU temperaturas
    tempCPU_halt=70     #If this temperature is reached by the CPU or motherboard, a HALT is performed for turning of the RIG.

    #---Clock limits
    corefreq_min=800    # Minimum freq. to be set by the control algorithm
    corefreq_max0=945   # Maximum freq. to be set by the control algorithm in GPU0 (In my case I checked that above 975Mhz this GPU hangs the X session)
    corefreq_max1=955   # Maximum freq. to be set by the control algorithm in GPU1 (In my case I checked that above 995Mhz the mining process got zombie)
    corefreq_max2=1025  # Maximum freq. to be set by the control algorithm in GPU2 (In my case I checked that above 1055Mhz the mining process got zombie)
    mem_freq=300       # Fixed value for memory clock (normally it is 1200MHz in the GPUs, but using 300MHz reduces the temperature without affecting to the performance)

    #--Mail sending
    subject="Important advice from your RIG"
    mail1="[email protected]"
    mail2="[email protected]"
    mail3="[email protected]"

    #--control constants
    retryMiningAfterFailure=1  # Mining scripts are automatically started after a failure
    debug=0       # enable/disable debugging messages
    numRetries=5       # Limit of retries for restarting the mining processes in the GPUs
    reboot=1                   # If zombie mining processes are detected, the control script can perform an automatic system reboot. This will recover mining in all GPUs.

    #--FAN constants
    FANGPU0=75
    FANGPU1=75
    FANGPU2=75

    #--Internal variables
    GPU0=0
    GPU1=1
    GPU2=2
    mining_stopped=0    # Mining process has been stopped by the control algorithm
    init_coreCLK0=900   # initial overclocking value for GPU0
    init_coreCLK1=900   # initial overclocking value for GPU1
    init_coreCLK2=900   # initial overclocking value for GPU2
    counterLastCLK0=0   # Stores the time of the last overclocking/downclocking performed on GPU0
    counterLastCLK1=0   # Stores the time of the last overclocking/downclocking performed on GPU1
    counterLastCLK2=0   # Stores the time of the last overclocking/downclocking performed on GPU2
    simulation=0       # Disables the overclockin, only logs outputs for debuguing .
    retriesGPU0=0
    retriesGPU1=0
    retriesGPU2=0
    alertFailProcessGPU0=0
    alertFailProcessGPU1=0
    alertFailProcessGPU2=0

    # ---------------------------------------------------------------------
    # Function Debug: It outputs messages to the console only in debug mode
    # Parameters: Text Message to be displayed
    # -----------------------------------------------------------------
    function debug(){
    if (test $debug -eq 1)
           then
        echo -e "[Time: $timeCounter | $(date | awk '{print $3 $2 $4}')] $@"
    fi
    }

    # ---------------------------------------------------------------------
    # Function output: This function output messages to the console
    # $1: Message
    # $2: if $2=1 the message is sent by email
    # ---------------------------------------------------------------------
    function output(){
    mensaje="[Time: $timeCounter | $(date | awk '{print $3 $2 $4}')] $1"
    echo -e $mensaje       
    if test $2 -eq 1
    then
    echo -e "$mensaje" | mail -s "$subject" $mail1
    echo -e "$mensaje" | mail -s "$subject" $mail2
    echo -e "$mensaje" | mail -s "$subject" $mail3
    fi
    }

    # ----------------------------------------------------------------
    # Function FANCommand: This function sets the FAN speed of a GPU
    # Params: $1:num_gpu: $GPU0,$GPU1,$GPU2
    #         $2:FAN_SPEED: Value from 0 to 100 %, ej: 100
    # Use:  FANCommand $GPU0 100
    # ----------------------------------------------------------------
    function FANCommand(){
    case $1 in
      0)
    DISPLAY=:0.0 aticonfig --pplib-cmd "set fanspeed 0 $2">>null;;
      1)
    DISPLAY=:0.1 aticonfig --pplib-cmd "set fanspeed 0 $2">>null;;
      2)
    DISPLAY=:0.2 aticonfig --pplib-cmd "set fanspeed 0 $2">>null;;
    esac
    output "  New setting: FAN GPU$1 to $2 %" 0
    }

    # ----------------------------------------------------------------
    # Function overclock: This function sets the clk of a GPU
    # Params: $1:num_gpu : $GPU0,$GPU1,$GPU2
    #         $2:clkfreq : core clk value to be set, ej: 850
    #         $3:memfreq : mem clk to be set, ej: 300 
    # Use:  overclock 0 850 1200
    # ----------------------------------------------------------------
    function overclock(){
    if test $simulation -eq 0
    then
    case $1 in
       0)
    aticonfig --adapter=0 --odsc=$2,$3 >>null;;
       1)
    aticonfig --adapter=1 --odsc=$2,$3 >>null;;
       2)
    aticonfig --adapter=2 --odsc=$2,$3 >>null;;
    esac
    fi
    output "  New setting: Overclock GPU$1 to $2 / $3 Mhz" 0
    }

    # ----------------------------------------------------------------
    # Function controlFAN: Calculates the FAN speed depending on the temperatures and current FAN speed (num_gpu, currentTemp, currentFAN)
    #   It performes the FAN speed control of a GPU
    # Params: $1:num_gpu : $GPU0,$GPU01,$GPU2
    #         $2:currentTemp: Current value reported by the GPU (no decimals), ej: 56
    #    $3:currentFAN:  Current FAN speed for this GPU, ej: 75 %
    #  
    # Use:  controlFAN $GPU0 56 75
    # ----------------------------------------------------------------
    function controlFAN(){

    #hysteresis of 2ºC around temperature threshold 55º
    if test $2 -lt 54
    then
    controlFAN=30
    elif test $2 -lt 56
    then
    if test $3 -ne 45
    then
    controlFAN=30
    fi
    #hysteresis of 2ºC around temperature threshold 60º
    elif test $2 -lt 59
    then
    controlFAN=45
    elif test $2 -lt 61
    then
    if test $3 -ne 60
    then
    controlFAN=45
    fi
    #hysteresis of 2ºC around temperature threshold 65º
    elif test $2 -lt 64
    then
    controlFAN=60
    elif test $2 -lt 66
    then
    if test $3 -ne 75
    then
    controlFAN=60
    fi
    #hysteresis of 2ºC around temperature threshold 70º
    elif test $2 -lt 69
    then
    controlFAN=75
    elif test $2 -lt 71
    then
    if test $3 -ne 90
    then
    controlFAN=75
    fi
    else
    controlFAN=90
      fi

    # It sends the FAN speed command only if the new setting is different to the current one.
    case $1 in
        0)
        debug "FAN control GPU0: Current FAN=$FANGPU0, controlFAN:$controlFAN"
      if test $controlFAN -ne $FANGPU0
                    then
      FANCommand 0 $controlFAN
    FANGPU0=$controlFAN
      fi;;
        1)
        debug "FAN control GPU1: Current FAN=$FANGPU1, controlFAN:$controlFAN"
      if test $controlFAN -ne $FANGPU1
                    then
      FANCommand 1 $controlFAN
    FANGPU1=$controlFAN
      fi;;
        2)
        debug "FAN control GPU2: Current FAN=$FANGPU2, controlFAN:$controlFAN"
      if test $controlFAN -ne $FANGPU2
                    then
      FANCommand 2 $controlFAN
    FANGPU2=$controlFAN
      fi;;
    esac
    }

    # ----------------------------------------------------------------
    # Function controlTemp: Calculates the GPU clock correction to be performed depending on the GPU temperatures (num_gpu, currentTemp, consignaTemp)
    #    It performes the overclocking/downclocking of the GPU
    # Params: $1:num_gpu : $GPU0,$GPU01,$GPU2
    #         $2:currentTemp: Current temperature reported by the GPU (no decimals), ej: 56
    #         $3:TargetTemp: Target temperature desired in this GPU as maximum, ej: 78
    # Outputs:
    #    counterLastCLK0, counterLastCLK1 y counterLastCLK2: Time information of the last CLK correction on each GPU.
    #  
    # Use:    controlTemp $GPU 56 78
    # ----------------------------------------------------------------
    function controlTemp(){
    offsetCLK=$(expr $3 - $2)

    # temperature gap defined in 'control_gap' is guaranteed to avoid causing stress to the GPU when the current temperature is very near to the target temperature
    if (test $offsetCLK -gt 0) && (test $offsetCLK -lt $control_gap)
    then
    debug "The correction ($offsetCLK) does not exceed the control GAP ($control_gap). CLK is maintained."
    return 1
    fi

    # Demanded frequencies are limited to the specific clk ranges of each GPU
    case $1 in
        0)
      demandaCLK=$(expr $coreCLK0 + $offsetCLK)
      if test $demandaCLK -gt $corefreq_max0
                    then
    demandaCLK=$corefreq_max0
               fi;;
        1)
      demandaCLK=$(expr $coreCLK1 + $offsetCLK)
      if test $demandaCLK -gt $corefreq_max1
      then
    demandaCLK=$corefreq_max1
               fi;;

        2)
      demandaCLK=$(expr $coreCLK2 + $offsetCLK)
      if test $demandaCLK -gt $corefreq_max2
                    then
    demandaCLK=$corefreq_max2
      fi;;
    esac

    if test $demandaCLK -lt $corefreq_min
           then
    demandaCLK=$corefreq_min
    fi

    debug "*** GPU$1 --> CurrentTemp:$2 - Consigna:$3 - Control:$demandaCLK ($offsetCLK)"

    # Sending of overclock command, only if there is a change.
    case $1 in
        0)
      if test $demandaCLK -ne $coreCLK0
                    then
      overclock 0 $demandaCLK $mem_freq
    counterLastCLK0=$timeCounter
    debug "Tiempo contCLK0: $counterLastCLK0"
      else
      debug "GPU0: Limit is already reached: $corefreq_max0."
      counterLastCLK0=$timeCounter
      fi;;
        1)
      if test $demandaCLK -ne $coreCLK1
                    then
    overclock 1 $demandaCLK $mem_freq
    counterLastCLK1=$timeCounter
    debug "Tiempo contCLK1: $counterLastCLK1"
      else
      debug "GPU1: Limit is already reached: $corefreq_max1."
      counterLastCLK1=$timeCounter
      fi;;
        2)
      if test $demandaCLK -ne $coreCLK2
                    then
    overclock 2 $demandaCLK $mem_freq
    counterLastCLK2=$timeCounter
    debug "Tiempo contCLK2: $counterLastCLK2"
      else
      debug "GPU2: Limit is already reached: $corefreq_max2."
      counterLastCLK2=$timeCounter
      fi;;
    esac
    }


    # ----------------------------------------------------------------
    # Function checkOverclockTimeGuard: This function ensures a certain period of time between consecutives overclock commands.
    # - Guard Time between overclocks: $overclock_delay*$control_time (180*5 = 15 minutes)
    # - Guard Time between downclocks: $downclock_delay*$control_time (60*5 = 5 minutes)
    # - Guard time between urgent downclocks: $downclock_urgent*$control_time (24*5 = 2 minutes)
    # Params: $1:num_gpu : 0,1,2
    #         $2:timeCounter: Current value of the time counter
    #         $3:up_down: 0:overclock, 1:downclock, 2:urgent downclock
    # Outputs:
    #    $return_correction: 0:Not to perform CLK correction. 1:CLK correction can be performed now.
    # ----------------------------------------------------------------
    function checkOverclockTimeGuard (){
    return_correction=0
    case $1 in
    0)
    if test $3 -eq 0
                         then  ##overclock
    due_time=$(expr $counterLastCLK0 + $overclock_delay)
    if test $2 -ge $due_time
                                then
    return_correction=1
    fi
    elif test $3 -eq 1
                         then  ##normal downclocking
    due_time=$(expr $counterLastCLK0 + $downclock_delay)
    if test $2 -ge $due_time
                                then
    return_correction=1
    fi

    elif test $3 -eq 2
                         then  ##urgent downclocking
    due_time=$(expr $counterLastCLK0 + $downclock_urgent)
    if test $2 -ge $due_time
                                then
    return_correction=1
    fi
    fi;;
           1)
    if test $3 -eq 0
                         then  ##overclocking
    due_time=$(expr $counterLastCLK1 + $overclock_delay)
    if test $2 -ge $due_time
                                then
    return_correction=1
    fi
    elif  test $3 -eq 1
                         then  ##normal downclocking
    due_time=$(expr $counterLastCLK1 + $downclock_delay)
    if test $2 -ge $due_time
                                then
    return_correction=1
    fi
    elif test $3 -eq 2
                         then  ##urgent downclocking
    due_time=$(expr $counterLastCLK1 + $downclock_urgent)
    if test $2 -ge $due_time
                                then
    return_correction=1
    fi
    fi;;
           2)
    if test $3 -eq 0
                         then  ##overclocking
    due_time=$(expr $counterLastCLK2 + $overclock_delay)
    if test $2 -ge $due_time
                                then
    return_correction=1
    fi
    elif test $3 -eq 1
                         then  ##normal downclocking
    due_time=$(expr $counterLastCLK2 + $downclock_delay)
    if test $2 -ge $due_time
                                then
    return_correction=1
    fi
    elif test $3 -eq 2
                         then  ##urgent downclocking
    due_time=$(expr $counterLastCLK2 + $downclock_urgent)
    if test $2 -ge $due_time
                                then
    return_correction=1
    fi
    fi;;
    esac
    debug "-------GPU$1: due_time: $due_time, correction: $return_correction"
    }

    # ---------------------------------------------------------------------------------------------------------
    # MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN
    # ---------------------------------------------------------------------------------------------------------
    #Overclockin enabling
    aticonfig --adapter=0 --od-enable   
    aticonfig --adapter=1 --od-enable   
    aticonfig --adapter=2 --od-enable   
    overclock 0 $init_coreCLK0 $mem_freq 
    overclock 1 $init_coreCLK1 $mem_freq 
    overclock 2 $init_coreCLK2 $mem_freq   
    FANCommand 0 30
    FANCommand 1 30
    FANCommand 2 30
    output "Automatic control algorithm has been started" 1  #This is sent by email.
    output "Automatic FAN speed control starts from 30%" 0
    while true; do

    #Fetching of current temperatures
    tempGPU0=$(aticonfig --adapter=0 --od-gettemperature | tail -n1 | awk '{print $5}' | cut -c1-2)
    tempGPU1=$(aticonfig --adapter=1 --od-gettemperature | tail -n1 | awk '{print $5}' | cut -c1-2)
    tempGPU2=$(aticonfig --adapter=2 --od-gettemperature | tail -n1 | awk '{print $5}' | cut -c1-2)
    tempCPU=$(sensors |grep CPU |grep Temperature | awk '{print $3}'|cut -c2-3)
    tempMB=$(sensors |grep NB |grep Temperature | awk '{print $3}'|cut -c2-3)

    #Fetching of current GPU CLK frequencies
    coreCLK0=$(aticonfig --adapter=0 --odgc |grep Clocks | awk '{print $4}')
    memCLK0=$(aticonfig --adapter=0 --odgc |grep Clocks | awk '{print $5}')
    coreCLK1=$(aticonfig --adapter=1 --odgc |grep Clocks | awk '{print $4}')
    memCLK1=$(aticonfig --adapter=1 --odgc |grep Clocks | awk '{print $5}')
    coreCLK2=$(aticonfig --adapter=2 --odgc |grep Clocks | awk '{print $4}')
    memCLK2=$(aticonfig --adapter=2 --odgc |grep Clocks | awk '{print $5}')

      #It detects if there are mining processes already running
      miningIsActive0=$(ls /var/run/screen/S-your_user/ |grep gpu0 |wc -l)   # we look for the 'screen' session lock file
            miningIsActive1=$(ls /var/run/screen/S-your_user/ |grep gpu1 |wc -l)   # we look for the 'screen' session lock file
            miningIsActive2=$(ls /var/run/screen/S-your_user/ |grep gpu2 |wc -l)   # we look for the 'screen' session lock file

    loadGPU0=$(aticonfig --adapter=0 --odgc |grep GPU |awk '{print $4}' | cut -c1-2)  # It is also checked the load of each GPU
    loadGPU1=$(aticonfig --adapter=1 --odgc |grep GPU |awk '{print $4}' | cut -c1-2)
    loadGPU2=$(aticonfig --adapter=2 --odgc |grep GPU |awk '{print $4}' | cut -c1-2)
    debug " --> Temps: $tempGPU0, $tempGPU1, $tempGPU2"
    debug " --> Clks: $coreCLK0, $coreCLK1, $coreCLK2"

    # --------------------------------Temperature of CPU and Motherboard ---------------------------------------
    if (test $tempCPU -gt $tempCPU_halt) || (test $tempMB -gt $tempCPU_halt)
            then
    output "ERR: Temperature of CPU/MB is too high! $tempCPU / $tempMB.... \nSWITCHING OFF THE SYSTEM. \n Check the CPU FAN condition and switch the RIG on manually." 1
    /usr/bin/halt
    fi

    # Checking of zombie mining processes.
    num_defunc=$(ps -Al |grep py|grep defunc| wc -l) 
    if test $num_defunc -gt 0
          then
    if test $reboot -eq 1
    then
        output "### ERR: There are one or more zombie mining processes:
    \nMaybe a mining process is hanged and blocked.
    \nIt is neccesary to restart the system for recovering the mining (sudo reboot).
    \n$(ps -Al |grep py|grep defunc| wc -l)
    \n --> PERFORMING AN AUTOMATIC REBOOT OF THE SYSTEM...." 1    # Sent my email
    /usr/bin/reboot
    else
    output "### ERR: There are one or more zombie mining processes:
    \nMaybe a mining process is hanged and blocked.
    \nIt is neccesary to restart the system for recovering the mining (sudo reboot).
    \n$(ps -Al |grep py|grep defunc| wc -l)" 1                    # Sent my email
    fi
    fi

    # --------------------------------Init checkings -------------------------------
    if (test $miningIsActive0 -eq 0)
            then
    if (test $mining_stopped -eq 0) && (test $retryMiningAfterFailure -eq 1)
                  then
    if test $retriesGPU0 -lt $numRetries
                         then
    output "### ERR the mining process in GPU0 is not started ......" 0
    output "*** Starting mining on GPU0..." 0
    overclock 0 $init_coreCLK0 $mem_freq 
    coreCLK0= $init_coreCLK0
    /usr/bin/screen -admS gpu0 ./gpu0.sh
    retriesGPU0=$(expr $retriesGPU0 + 1)
    elif test $alertFailProcessGPU0 -eq 0
    then
    output "### ERR Mining retries limit has been reached in the process GPU0.sh
    \n*** Check that the process is not zombie and start it manually " 1    # Sent by email
    alertFailProcessGPU0=1
    fi
    fi
    fi

    if (test $miningIsActive1 -eq 0)
            then
    if (test $mining_stopped -eq 0) && (test $retryMiningAfterFailure -eq 1)
                  then
    if test $retriesGPU1 -lt $numRetries
                         then
    output "### ERR the mining process in GPU1 is not started ......" 0
    output "*** Starting mining on GPU1..." 0
    overclock 1 $init_coreCLK1 $mem_freq 
    coreCLK1= $init_coreCLK1
    /usr/bin/screen -admS gpu1 ./gpu1.sh
    retriesGPU1=$(expr $retriesGPU1 + 1)
    elif test $alertFailProcessGPU1 -eq 0       
    then
    output "### ERR Mining retries limit has been reached in the process GPU1.sh
    \n*** Check that the process is not zombie and start it manually " 1    # Sent by email
          alertFailProcessGPU1=1
    fi
    fi
    fi


    if (test $miningIsActive2 -eq 0)
            then
    if (test $mining_stopped -eq 0) && (test $retryMiningAfterFailure -eq 1)
                  then
    if test $retriesGPU2 -lt $numRetries
                         then
    output "### ERR the mining process in GPU2 is not started ......" 0
    output "*** Starting mining on GPU2..." 0
    overclock 2 $init_coreCLK2 $mem_freq 
    coreCLK2= $init_coreCLK2
    /usr/bin/screen -admS gpu2 ./gpu2.sh
    retriesGPU2=$(expr $retriesGPU2 + 1)
    elif test $alertFailProcessGPU2 -eq 0
    then
    output "### ERR Mining retries limit has been reached in the process GPU2.sh
    \n*** Check that the process is not zombie and start it manually " 1    # Sent by email
    alertFailProcessGPU2=1   
    fi
    fi
    fi


    # --------------------------------Automatic switching off control -----------------------------
    if (test $tempGPU0 -gt $maxtemp_stop) || (test $tempGPU1 -gt $maxtemp_stop) || (test $tempGPU2 -gt $maxtemp_stop)
    then
    if (test $mining_stopped -eq 0)
    then

    if test $retryMiningAfterFailure -eq 1
    then
    output "ERR: Extreme temperature in GPUs ($tempGPU0, $tempGPU1, $tempGPU2 ºC) - Switching off the mining...
           \n After some minutes it will be strated again ..." 1   # Sent by email
    else
    output "ERR: Extreme Temperature in GPUs ($tempGPU0, $tempGPU1, $tempGPU2 ºC) - Switching off the mining...
           \n Start the mining process manually ..." 1 # Sent by email
    fi
    ./stop.sh
    mining_stopped=1
    fi

    else  # As soon as the GPUs temperatures are below temp_recover, mining is started again.  
        if (test $mining_stopped -eq 1) && (test $retryMiningAfterFailure -eq 1) 
        then
    if (test $tempGPU0 -lt $temp_recover) || (test $tempGPU1 -lt $temp_recover) || (test $tempGPU2 -lt $temp_recover)
    then
    # It Sets safe GPUs clock values
    output "The temperature of the GPUs has been recovered to $tempGPU0 / $tempGPU1 / $tempGPU2" 0
    output "GPUS clocks are stablished to 850/300 MHz." 0
    overclock 0 $init_coreCLK0 $mem_freq 
    overclock 1 $init_coreCLK1 $mem_freq 
    overclock 2 $init_coreCLK2 $mem_freq 
    coreCLK0= $init_coreCLK0
    coreCLK1= $init_coreCLK1
    coreCLK2= $init_coreCLK2
    retriesGPU0=0
    retriesGPU1=0
    retriesGPU2=0
    output " --> Starting mining." 0
    ./minar.sh
    mining_stopped=0
    fi
        fi
    fi

    #------------------------------  Overclocking control on GPU0 ----------------------------------

    if (test $mining_stopped -eq 0) && (test $miningIsActive0 -eq 1)
    then
    #The temperature is within the control margins, below target temp
    if (test $tempGPU0 -lt $target_temp)
              then   
    checkOverclockTimeGuard $GPU0 $timeCounter 0
    if test $return_correction -eq 1
                  then
    controlTemp $GPU0 $tempGPU0 $target_temp
    fi

    #The temperature is outside the control margins, below the alarm temp
    elif (test $tempGPU0 -lt $hightemp_alarm)
            then
    checkOverclockTimeGuard $GPU0 $timeCounter 1 #downclocking
    if test $return_correction -eq 1
                  then
    controlTemp  $GPU0 $tempGPU0 $target_temp
    fi

    # Overtemp alarm
    elif (test $tempGPU0 -lt $maxtemp_stop)
            then
    output "Alarm! GPU0 very hot, temperature: $tempGPU0. Performing urgent downclocking ...." 1    #Sent by email
    checkOverclockTimeGuard $GPU0 $timeCounter 2 #urgent downclocking
    if test $return -eq 1
                  then
    controlTemp  $GPU0 $tempGPU0 $target_temp
    fi
    fi
    # FAN Speed control
    controlFAN $GPU0 $tempGPU0 $FANGPU0
    fi

    #------------------------------  Overclocking control on GPU1 ----------------------------------

    if (test $mining_stopped -eq 0) && (test $miningIsActive1 -eq 1)
    then
    #The temperature is within the control margins, below target temp
    if (test $tempGPU1 -lt $target_temp) && (test $miningIsActive1 -eq 1) && (test $mining_stopped -eq 0) 
            then   
    checkOverclockTimeGuard $GPU1 $timeCounter 0
    if test $return_correction -eq 1
                  then
    controlTemp  $GPU1 $tempGPU1 $target_temp
    fi

    #The temperature is outside the control margins, below the alarm temp
    elif (test $tempGPU1 -lt $hightemp_alarm) && (test $miningIsActive1 -eq 1) && (test $mining_stopped -eq 0)
            then
    checkOverclockTimeGuard $GPU1 $timeCounter 1 #downclocking
    if test $return_correction -eq 1
                  then
    controlTemp  $GPU1 $tempGPU1 $target_temp
    fi

    # Overtemp alarm
    elif (test $tempGPU1 -lt $maxtemp_stop) && (test $miningIsActive1 -eq 1) && (test $mining_stopped -eq 0)
            then
    output "Alarm! GPU1 very hot, temperature: $tempGPU1. Performing urgent downclocking ...." 1    #Sent by email
    checkOverclockTimeGuard $GPU1 $timeCounter 2 #urgent downclocking
    if test $return_correction -eq 1
                  then
    controlTemp  $GPU1 $tempGPU1 $target_temp
    fi
    fi
    # FAN Speed control
    controlFAN $GPU1 $tempGPU1 $FANGPU1
    fi
    #------------------------------  Overclocking control on GPU2 ----------------------------------
    if (test $mining_stopped -eq 0) && (test $miningIsActive2 -eq 1)
    then
    #The temperature is within the control margins, below target temp
    if (test $tempGPU2 -lt $target_temp) && (test $miningIsActive2 -eq 1) && (test $mining_stopped -eq 0)
            then   
    checkOverclockTimeGuard $GPU2 $timeCounter 0
    if test $return_correction -eq 1
                  then
    controlTemp  $GPU2 $tempGPU2 $target_temp
    fi

    #The temperature is outside the control margins, below the alarm temp
    elif (test $tempGPU2 -lt $hightemp_alarm) && (test $miningIsActive2 -eq 1) && (test $mining_stopped -eq 0)
            then
    checkOverclockTimeGuard $GPU2 $timeCounter 1 #downclocking
    if test $return_correction -eq 1
                  then
    controlTemp  $GPU2 $tempGPU2 $target_temp
    fi

    # Overtemp alarm
    elif (test $tempGPU2 -lt $maxtemp_stop) && (test $miningIsActive2 -eq 1) && (test $mining_stopped -eq 0)
            then
    output "Alarm! GPU2 very hot, temperature: $tempGPU2. Performing urgent downclocking ...." 1    #Sent by email
    checkOverclockTimeGuard $GPU2 $timeCounter 2 #urgent downclocking
    if test $return_correction -eq 1
                  then
    controlTemp  $GPU2 $tempGPU2 $target_temp
    fi
    fi
    # FAN Speed control
    controlFAN $GPU2 $tempGPU2 $FANGPU2
    fi
    timeCounter=$(expr $timeCounter + 1)
    sleep $control_time;
    done

    A brief description of the control script:
    • You can change the minimum/maximum clock settings for each of your GPUs. I manually identified the limits by getting the system hang lot several times.  Roll Eyes
    • I also realized that when playing with the limits, before hanging the system, sometimes a mining process got zombie (by using PS). In this situation, I was not able to recover the process, neither trying to kill the parent process... the only way was to restart the system. This control algorithm is restarting the system when finding zombie processes.
    • You can play with all constants for tuning the script to your own system. Almost everything is configurable (retries number, mails sending, debugging logs, halt/reboot commands, etc.
    • Change the email addresses by yours
    • 1. The algorithm first obtain the GPUs temperatures, CPU temperatures, current clocks settings, checks if the mining processes are active, etc
    • 2. In case the CPU temperature is very high (70ºC), the script switches off the system and report by email (This protects the system hardware from a overtemperature in the CPU)
    • 3. It checks if there are zombie processes (As discussed before). If so, the script can reboot the system and report by email. (depends on if constant reboot=1)
    • 4. It checks if any of the GPUS is not mining... if so it retries the mining by starting the script gpux.sh. There is a retry limit of 5. It reached it is also reported by email
    • 5. In case a GPU has reached a very high temperature (83ºC) it stops all mining processes. After the temperature has been recovered, it restart the mining.
    • 6. For each GPU, the script perform an automatic control of the GPU clock by overclockin, downclocking and urgent downclocking when needed. 
    • 7. For each GPU, the script perform an automatic control of the FAN speed. 

    As I did for the gpux.sh scripts, I like to launck the control from another script tubing the output to the tee command in order to store the logs in a file. Therefore we will easily get them
    from the monitor script:

    start_control.sh
    Code:
    #!/bin/bash
    cd /home/your_user/scripts
    ./control.sh | tee control.log

    Now, let's go to the monitor script:

    monitor.sh
    Code:
    #!/bin/bash
    while true; do
            echo "---------------- GPUs Health ----------------"
            aticonfig --adapter=0 --od-gettemperature | tail -n1 | awk '{print "GPU0 Temperature: " $5}' ;
            aticonfig --adapter=1 --od-gettemperature | tail -n1 | awk '{print "GPU1 Temperature: " $5}' ;
            aticonfig --adapter=2 --od-gettemperature | tail -n1 | awk '{print "GPU2 Temperature: " $5}' ;
            echo $(aticonfig --adapter=0 --odgc | grep GPU);
            echo $(aticonfig --adapter=1 --odgc | grep GPU);
    echo $(aticonfig --adapter=2 --odgc | grep GPU);
            echo "GPU FANS: $(DISPLAY=:0.0 aticonfig --pplib-cmd 'get fanspeed 0'|grep Result |cut -d ' ' -f 4) / $(DISPLAY=:0.1 aticonfig --pplib-cmd 'get fanspeed 0'|grep Result |cut -d ' ' -f 4) / $(DISPLAY=:0.2 aticonfig --pplib-cmd 'get fanspeed 0'|grep Result |cut -d ' ' -f 4)"
            echo "Overclocking...."
            echo "-   Core Clocks: $(aticonfig --adapter=0 --odgc |grep Clocks |cut -d ' ' -f 18) / $(aticonfig --adapter=1 --odgc |grep Clocks |cut -d ' ' -f 18) / $(aticonfig --adapter=2 --odgc |grep Clocks |cut -d ' ' -f 18) Mhz."
            echo "-   Mem Clocks: $(aticonfig --adapter=0 --odgc |grep Clocks |cut -d ' ' -f 29) / $(aticonfig --adapter=1 --odgc |grep Clocks |cut -d ' ' -f 29) / $(aticonfig --adapter=2 --odgc |grep Clocks |cut -d ' ' -f 29) Mhz."
    #echo " "
    echo ----------------  PC health -------------------
            echo $(sensors |grep CPU |grep Temperature) | cut -d ' ' -f 1,2,3
            echo $(sensors |grep NB |grep Temperature) | cut -d ' ' -f 1,2,3
            echo $(sensors |grep SB |grep Temperature) | cut -d ' ' -f 1,2,3
    echo "HDD Avail: $(df -h |grep sda1 |cut -d ' ' -f 20)"
            #echo " "
    echo "---------------- Mining rate ------------------"
            # Check if there are Screen lock files....
            IsMining_gpu0=$(ls /var/run/screen/S-your_user/ |grep gpu0 |wc -l)
            IsMining_gpu1=$(ls /var/run/screen/S-your_user/ |grep gpu1 |wc -l)
            IsMining_gpu2=$(ls /var/run/screen/S-your_user/ |grep gpu2 |wc -l)
            # Last Hashrate report
    if test $IsMining_gpu0 -ge 1
    then
    echo "Mining on GPU0: " $(cat mining_gpu0.log | cut -d '[' -f 2)
    else
    echo "Mining on GPU0:  ERR! Mining Process is stopped!"
    fi

            if test $IsMining_gpu1 -ge 1
            then
    echo "Mining on GPU1: " $(cat mining_gpu1.log | cut  -d '[' -f 2)
            else
                  echo "Mining on GPU1:  ERR! Mining process is stopped!"
            fi
            if test $IsMining_gpu2 -ge 1
            then
    echo "Mining on GPU2: " $(cat mining_gpu2.log | cut -d '[' -f 2)
            else
                  echo "Mining on GPU2:  ERR! Mining process is stopped!"
            fi
    #Erase mining logs... next loop we will find only the hashrate.
    echo "" > mining_gpu0.log
    echo "" > mining_gpu1.log
    echo "" > mining_gpu2.log
            controlIsActive=$(ls /var/run/screen/S-vamach/ |grep control |wc -l)
            #echo " "
    echo "---------- Logs Mining Controller -------------"
    if test $controlIsActive-ge 1
            then
                    tail -5 control.log
            else
                    echo "Control algorithm is OFF"
            fi
      sleep 5;
            clear
    done


    A few notes for the monitor script:
    • Note that the GPUs logs are erased at each loop. The time between loops is 5 seconds, exactly the same as the display rate of the poclbm.py. This way, we ensure that in the logs we will always find a single report of hashrates. In addition, the logs will not be increasing forever. 
    • the monitor script is also displaying the last 5 lines of the control script logs, stored in the file control.log

    As before, I launch the monitor.sh script from start_monitor.sh:

    start_monitor.sh
    Code:
    #!/bin/bash
    cd /home/your_user/scripts
    monitorIsRunning=$(ls /var/run/screen/S-your_user/ |grep monitor |wc -l)
    if test $monitorIsRunning -ge 1
    then
    echo "Monitor script is already running in another Screen. Getting attached..."
    screen -x monitor
    else
    echo "Monitor script is not running. Starting..."
    /usr/bin/screen -admS monitor ./monitor.sh
    fi

    As you can see, this script is valid both for starting the monitor or for attaching to the screen in which the monitor is already running.
    I did a symbolic link  to this file called "m" (see ln command). From then, all I need to do for monitoring my rig is entering "m" in the console. (this is very comfortable when accessing to the RIG from my mobile)  Grin


    This is the output of the monitor script (updated each 5 seconds):


    ---------------- GPUs Health ----------------
    GPU0 Temperature: 57.50
    GPU1 Temperature: 54.50
    GPU2 Temperature: 57.50
    GPU load : 98%
    GPU load : 97%
    GPU load : 98%
    GPU FANS: 45% / 45% / 45%
    Overclocking....
    -   Core Clocks: 945 / 955 / 1020 Mhz.
    -   Mem Clocks: 300 / 300 / 300 Mhz.
    ---------------- PC health -------------------
    CPU Temperature: +31.0°C
    NB Temperature: +43.0°C
    SB Temperature: +31.0°C
    HDD Avail: 2GB
    ---------------- Mining rate ------------------
    Mining on GPU0:  402.743 MH/s (~458 MH/s)]
    Mining on GPU1:  405.513 MH/s (~568 MH/s)]
    Mining on GPU2:  435.513 MH/s (~598 MH/s)]
    ---------- Logs Mining Controller -------------
    [Time: 18616 | 3ene15:50:30] New setting: FAN GPU0 to 30 %
    [Time: 18617 | 3ene15:50:35] New setting: FAN GPU0 to 45 %
    [Time: 18618 | 3ene15:50:40] New setting: FAN GPU1 to 45 %
    [Time: 18783 | 3ene16:05:29] New setting: FAN GPU1 to 30 %
    [Time: 18787 | 3ene16:05:51] New setting: FAN GPU1 to 45 %


    Now, we can complete the start.sh script for adding the control and monitor scripts:


    start.sh
    Code:
    !/bin/bash

    cd /home/your_path/scripts
    echo Starting mining scripts...
    /usr/bin/screen -admS gpu0 ./gpu0.sh
    /usr/bin/screen -admS gpu1 ./gpu1.sh
    /usr/bin/screen -admS gpu2 ./gpu2.sh
    echo Starting monitor script...
    /usr/bin/screen -admS monitor ./monitor.sh
    echo Starting automatic control script...
    /usr/bin/screen -admS control ./start_control.sh
    echo " "
    echo For monitoring the RIG, enter m. 

    And of course, we will need a stop.sh script for stopping all the mining scripts, monitor and control scripts:

    stop.sh
    Code:
    #!/bin/bash
    screen -X -S gpu0 kill
    screen -X -S gpu1 kill
    screen -X -S gpu2 kill
    screen -X -S monitor kill
    screen -X -S control kill
    killall screen

    That's all folks!!

    I hope you liked this post, and will be useful for your mining systems!!



    If you liked this post, and want to send me a donation, I will be very gratefull, and will give me energy for sharing other works.
    BTCTC Address:  1NKJuhGCx7HM2skXdzAkfnxJyfsubh475A


Jump to: