Author

Topic: [ mining os ] nvoc - page 297. (Read 418549 times)

newbie
Activity: 5
Merit: 0
July 12, 2017, 04:25:53 AM
ubuntu server 16.4
driver nvidia_381

does not work
X server nvidia-settings

sudo nvidia-settings
Failed to connect to Mir: Failed to connect to server socket
Unable to init server: Could not connect

ERROR: The control display is undefined; please run `nvidia-settings --help`
for usage information.

Reinstall does not help

Is this with nvOC or vanilla Ubuntu?



Server ubuntu 16.4 without add-ons
Clean OS


What hardware are you using?

Do you have an AMD GPU attached as well?

If you are using a server, does it have an AMD component inside it?

AMD component no
driver nvidia 381
gtx 1080 ti
legendary
Activity: 1834
Merit: 1080
---- winter*juvia -----
July 12, 2017, 04:22:36 AM
fullzero - I just got the new TB250 12 x PCI slot mobo, testing 12 x Zotac Minis 1070s and Zotac 12 x 1060 Mining cards.
I see the new bash file now goes up to 15 cards so it should run.
Will post results and pix in a few days. Thanks
newbie
Activity: 6
Merit: 0
July 12, 2017, 04:06:46 AM
I'm using nvOC and really love it, my rig run stable, easy to config. Thank fullzero very much !

When i have good income, i'll donate you.
sr. member
Activity: 353
Merit: 251
July 12, 2017, 02:48:47 AM
#On one of the longest running rigs I got a message that I'm low on space . Around 600mb . I have a SSD on that RIG .
Can the logs generated be that big ?
Claymore's logs are huge.

Quote
#Where can I find the claymore logs ? I can't see it in the 9.5 folder like on win
AFAIR, they are disabled with '-dbg -1' option. Change this if you need logs.

Quote
#In genoil ... how can I see the uptime of the rig ?
Rig or miner? Have no experience with genoil, but 'uptime' shell command can give you the rig uptime.
sr. member
Activity: 353
Merit: 251
July 12, 2017, 02:44:37 AM
I understood that you meant to add oneBash only for github.  Look at the number of changes in this oneBash alone; and then consider how much more there would be to look at with input from even a few members.  Until most of the features members want have a basic implementation; this is essentially only going to give me another thing I need to pay attention to.

Git suggestion was only to simplify the implementation of all changes and make easier to track what's changed.
Using the script as a learning example is the point, but I also think that using bash loops is a better example than copy/paste style.
Finally, oneBash is good, but I would split it into separate parts like overclocking, miner selection, watchdog, monero mining, etc to make easier to tune/restart some parts.

In any case, I really appreciate your efforts. I did not use nvidia GPU till now, and your distro was the best way to start using nvidia with Linux. Thank you for working on it and sharing.
newbie
Activity: 35
Merit: 0
July 12, 2017, 01:56:54 AM
#On one of the longest running rigs I got a message that I'm low on space . Around 600mb . I have a SSD on that RIG .
Can the logs generated be that big ?

#Where can I find the claymore logs ? I can't see it in the 9.5 folder like on win

#I have 2 rigs that report 0 hash rate on ethermine and nano but are working - Both on Genoil...right address... . Anybody knows why is that ?

#In genoil ... how can I see the uptime of the rig ?

Thanks!
full member
Activity: 136
Merit: 100
July 12, 2017, 01:09:02 AM
I tried mining zec on slushpool, nice site, getting the workers to populate correctly with nvOC was frustrating, and the returns are not as high or consistent than other pools. However I did earn .33 zec when I normally earn .18 zec on their pool with 6.5k sols

What pools are you all using to mine ZEC currently

i was on nanopool but my payouts were lower than estimated by calculators.  I've been on flypool for 9 days straight now, and it's been spot on.
newbie
Activity: 44
Merit: 0
July 11, 2017, 06:39:43 PM
[...snip...]
I can add https://github.com/fireice-uk/xmr-stak-cpu

I will add it to the list once I can update the OP again.

The easiest way to do this would be to take a copy of oneBash; rename it, remove everything but the OC settings and implementation.  Then you can run that renamed bashfile whenever to change clocks.



Actually xmr-stak-cpu is a great idea! It's super stable - I actually added it to one image of nvOS where I'm also cpu mining a while ago and it plays really nice with everything else on the distro.
While at it I would suggest adding xmr-stak-nvidia I meant to test it when I had  some cycles to spare, but since xmr-stak-cpu is on the radar, maybe we can also add it's nvidia cousin?


I will add the GPU variant as well.


I keep getting a database error when I try to update the OP:  Angry


plusWATCHDOG_oneBash + additional files (includes newest SRR,  switch_v3, reboot, AutoTEMP, Watchdog, Claymore 9.7) Link


I integrated a slightly modified IAmNotAJeep_and_Maxximus007_WATCHDOG, fixed the typo in Maxximus007_AUTO_TEMPERATURE_CONTROL.


saflter your newest version of switch was causing problems when run with a monitor connected (LOCAL); I would recommend relying on the:

IAmNotAJeep_and_Maxximus007_WATCHDOG

to handle miner crashes / 0 hashrates. 

I spent a couple hours testing this, and it is very effective; it is worth noting that it currently only works when the mining process is launched in a screen ( I will make it work for all the clients even when run locally soon: so don't spend a lot of time upgrading rigs with this)

Also even if your crashes are perfectly handled; if your OC is so high it crashes every 7 minutes or less: you are losing more time restarting the mining process then you are gaining with a slightly higher hashrate. 

Use reasonable OC.  Smiley

Please provide me with:

# IAmNotAJeep BTC address: 

# Maxximus007 BTC address: 

# _Parallax_ BTC address: 



It's great to see everyone get involved and speaking for myself, to feel like it's OK to contribute once in a while as well.
Hats off to fullzero, Maxximus007 and _Parallax_!

Here you go:
# IAmNotAJeep BTC address:  <13PnEKpfVzNseWkrm6LoueKcCMPj74zPv7>

I am reading the discussion between you and Maxximus007 now.


An unintended consequence of the watchdog script will be that it will keep rebooting the miners if there is no internet connection.
Guess how I know that one lol
newbie
Activity: 39
Merit: 0
July 11, 2017, 06:18:30 PM
Having some issues with the v0017 setup..for some reason the worker isn't showing up in my pool. just using nanopool for the time being. its only showing my one rig. made sure the wallet is right and all that and they have very different worker names..any thoughts? the other version im using is v0015
newbie
Activity: 44
Merit: 0
July 11, 2017, 06:01:00 PM

I keep getting a database error when I try to update the OP:  Angry


plusWATCHDOG_oneBash + additional files (includes newest SRR,  switch_v3, reboot, AutoTEMP, Watchdog, Claymore 9.7) Link


I integrated a slightly modified IAmNotAJeep_and_Maxximus007_WATCHDOG, fixed the typo in Maxximus007_AUTO_TEMPERATURE_CONTROL.


saflter your newest version of switch was causing problems when run with a monitor connected (LOCAL); I would recommend relying on the:

IAmNotAJeep_and_Maxximus007_WATCHDOG

to handle miner crashes / 0 hashrates. 

I spent a couple hours testing this, and it is very effective; it is worth noting that it currently only works when the mining process is launched in a screen ( I will make it work for all the clients even when run locally soon: so don't spend a lot of time upgrading rigs with this)

Also even if your crashes are perfectly handled; if your OC is so high it crashes every 7 minutes or less: you are losing more time restarting the mining process then you are gaining with a slightly higher hashrate. 

Use reasonable OC.  Smiley

Please provide me with:

# IAmNotAJeep BTC address: 

# Maxximus007 BTC address: 

# _Parallax_ BTC address: 



It's great to see everyone get involved and speaking for myself, to feel like it's OK to contribute once in a while as well.
Hats off to fullzero, Maxximus007 and _Parallax_!

Here you go:
# IAmNotAJeep BTC address:  <13PnEKpfVzNseWkrm6LoueKcCMPj74zPv7>

newbie
Activity: 44
Merit: 0
July 11, 2017, 05:54:35 PM
[...snip...]
I can add https://github.com/fireice-uk/xmr-stak-cpu

I will add it to the list once I can update the OP again.

The easiest way to do this would be to take a copy of oneBash; rename it, remove everything but the OC settings and implementation.  Then you can run that renamed bashfile whenever to change clocks.



Actually xmr-stak-cpu is a great idea! It's super stable - I actually added it to one image of nvOS where I'm also cpu mining a while ago and it plays really nice with everything else on the distro.
While at it I would suggest adding xmr-stak-nvidia I meant to test it when I had  some cycles to spare, but since xmr-stak-cpu is on the radar, maybe we can also add it's nvidia cousin?

 
full member
Activity: 169
Merit: 100
July 11, 2017, 05:19:15 PM
Hi,

I have few Gigabyte Z270-gaming K3 mobos...they have killer networks E2500 LAN and nvOC boots without LAN. I assume there is no driver support in this distribution. So can you tell me how to install LAN driver, manually or can it be inserted in next version (18) ?.

Or any other solution...

Let me know

Best regards

Personally I dislike Killer Ethernet NICs.

I would get one of these or similar for each mobo and never use the Killer NICs.

https://www.amazon.com/Cable-Matters-Ethernet-Network-Adapter/dp/B00ET4KHJ2

Any of the usb 2.0 adapters should be more than enough for a mining rig.

Hi Fullzero...

I have managed to find solution in order that existing drivers works for killerNIC E2500 (device ID 1969)

Code:
sudo modprobe alx
echo 1969 e0b1 | sudo tee /sys/bus/pci/drivers/alx/new_id

However , this need to be done after every reset...is there way to insert this to autostart after system boots...sorry for noob questions. I received 10 of these motherboards and need to get them to work. USB NIC are last option...

let me know if you can help me to solve this...probably more other owners of GA mobos with Killer NICs.

Best regards


First make the command above you have it into a script. Once you have done this you can then call it to your cron job so it will execute on every startup.

Now run "crontab -e" this will allow you to edit cron

Now add this to cron
@reboot /path/to/myscript
newbie
Activity: 6
Merit: 9
July 11, 2017, 05:11:16 PM
Hi,

I have few Gigabyte Z270-gaming K3 mobos...they have killer networks E2500 LAN and nvOC boots without LAN. I assume there is no driver support in this distribution. So can you tell me how to install LAN driver, manually or can it be inserted in next version (18) ?.

Or any other solution...

Let me know

Best regards

Personally I dislike Killer Ethernet NICs.

I would get one of these or similar for each mobo and never use the Killer NICs.

https://www.amazon.com/Cable-Matters-Ethernet-Network-Adapter/dp/B00ET4KHJ2

Any of the usb 2.0 adapters should be more than enough for a mining rig.

Hi Fullzero...

I have managed to find solution in order that existing drivers works for killerNIC E2500 (device ID 1969)

Code:
sudo modprobe alx
echo 1969 e0b1 | sudo tee /sys/bus/pci/drivers/alx/new_id

However , this need to be done after every reset...is there way to insert this to autostart after system boots...sorry for noob questions. I received 10 of these motherboards and need to get them to work. USB NIC are last option...

let me know if you can help me to solve this...probably more other owners of GA mobos with Killer NICs.

Best regards
full member
Activity: 122
Merit: 100
July 11, 2017, 02:36:33 PM
Hey guys, any clue what might account for the difference in PCIE utilization? asking because GPU6 is my least stable card that requires the lowest clocks of the bunch and evidently it also has the lowest PCIE utilization.


I had an issue like that, it was because the pcie risers were not giving sufficient power.  I had multiple risers on one power line, and the setup didn't work right.  Once I put them all on their own power lines it worked properly again.  You might be able to get away with 2 on a line, depending on your PSU, but if you have enough lines to put them all on their own i'd do that.
newbie
Activity: 44
Merit: 0
July 11, 2017, 01:15:23 PM
Hey fullzero, i have a question,

without a doubt my biggest problem right now is that when my miner crashes it takes the whole rig down with it, everything gets stuck, SSH barely works, average system load jumps to 14.5!! and Xorg takes up 100% of the CPU, its so bad that none of the standard reboot commands work, they just do nothing, the only thing that actually reboots the rig in this state is "echo b > /proc/sysrq-trigger" so i've set up a script that checks the average system load and if its over 2 it uses the command to reboot, and it works, but i dont like this "solution", yesterday after a reboot nvOC got corrupted somehow, lost my customized oneBash and the whole system became read-only (thankfully i had a oneBash backup that was only a few days behind).

so the question is, what can i do to relive this Xorg error, i run a 7 card rig and never plan on going for a higher number, what can i do with Xorg that would fix this?

Thanks.

@ tempgoga

It seems that whenever a soft crash occurs most of the cards drop to zero, so while the display/keyboard is unresponsive you can catch the soft crash from nvidia-smi. The script below checks card utilization, if it drops below 90% it counts down a minute and if mining hasn't resumed it reboots the system.
This seems to have worked at least once in my case (only got one soft crash this weekend) and the system recovered as expected.
the threshold values work for my setup but others may find different values optimal

Also if anyone knows a way to iterate the if && statements we can get the card count from "cards=$(nvidia-smi -L | wc -l); echo $cards" but the way below also works with manual editing to adjust the watchdog for the number of cards in you individual system.
___________
 
#!/bin/bash
#m1
threshold=90
while sleep 5
 do number=$(nvidia-smi |grep % |awk '{print $13}' |tr -d %)
 set -- $number
 echo -e "$@"
# The "if and" statements below need to be manually adjusted to match the number of cards in your system
# If you have 5 cards, leave is as, if a different number of cards remove or add the && statements as needed as in the example below
        if [[ "$1" -gt "$threshold" ]] && \
           [[ "$2" -gt "$threshold" ]] && \
           [[ "$3" -gt "$threshold" ]] && \
           [[ "$4" -gt "$threshold" ]] && \
           [[ "$5" -gt "$threshold" ]]
# && \
#          [[ "$6" -gt "$threshold" ]]
         then i=12
         echo OK
         else echo $((i--))
        fi
        if [ $i -le 0 ]
         then echo $(date) REBOOT due to soft crash >>~/watchdog.log
         sleep -5
         sudo shutdown now -r
        fi
done
___________

Hey thats funny I just made a script doing something similar, although it checks the powerdraw.
Here it is:
Code:
#!/bin/bash

# Miner restart script V001
# By Maxximus007
# for nvOC by fullzero
#
# POWERLIMIT MUST BE SET IN oneBash

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT_LOW_POWER=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read POWERDRAW POWERLIMIT; } < <( nvidia-smi -i $gpu --query-gpu=power.draw,power.limit --format=csv,noheader,nounits)

  let POWER_DIFF=$( printf "%.0f" $POWERLIMIT )-$( printf "%.0f" $POWERDRAW )

  # If current draw is 30 Watt lower than the limit count them:
  if [ "$POWER_DIFF" -gt "30" ]
  then
    let COUNT_LOW_POWER=COUNT_LOW_POWER+1
  fi

  let gpu=gpu+1
done

if [ $COUNT_LOW_POWER -eq $GPUS ]
then
  echo "$(date) - Power draw is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

You can combine the above with your code, and find the utilization like this:
Code:
nvidia-smi -i 1 --query-gpu=utilization.gpu --format=csv,noheader,nounits
You have to iterate the GPU, starting at 0 to get them all
Okay I've combined the two, perhaps this will work for most of us:
Code:
#!/bin/bash

# Miner restart script V002
# By Maxximus007 && IAmNotAJeep
# for nvOC by fullzero
#

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

MIN_UTIL=90
RESTART=0

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read UTIL; } < <( nvidia-smi -i $gpu --query-gpu=utilization.gpu --format=csv,noheader,nounits)

  let UTILIZATION=$( printf "%.0f" $UTIL )

  # If current utilizations lower than the limit count them:
  if [ $UTILIZATION -lt $MIN_UTIL ]
  then
    let COUNT=COUNT+1
  fi

  let gpu=gpu+1
done

if [ $COUNT -eq $GPUS ]
then
  if [ $RESTART -gt 1 ]
  then
    echo "$(date) - Utilization is too low: reviving did not work so restarting system" | tee -a ${LOG_FILE}
    sudo shutdown now -r
  fi
  echo "$(date) - Utilization is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
  let RESTART=RESTART+1
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

Pretty cool!  I'll try it tonight, lets hope this put the softcrash issues behind us.


I will try this out as well; good work.  Smiley

@ Maxximus007
Thanks for putting these together, great collab!
I'm not a bash expert, so maybe I'm reading this wrong, but here are some thoughts.
The combined code seems to be evaluating each gpu individually for the fault condition to be met, which means if one fails and you have say 5 other cards working then it keeps going until all the cards give reduced output since all of them have to fail individually to increment the counter?So if 5/6 fail we keep going? (Again just looking at it and tracing it in my head so maybe I'm reading wrong).
The way I was thinking about it, is that I wanted all the cards to work at above 90% efficiency and reboot as soon as any card strays beyond the threshold - this is why I did the "if and" statement and didn't iterate though "if" statements alone (I didn't know how to iterate "if and" based on an unknown number of cards lol). I had a version giving 6xOK and such but I think it's more efficient to just get 1xOK if ALL meet the 90% criteria and start the countdown as soon as anything is out of norm - and if the miner recovers, flush the counter. I observed a number of these conditions with Claymore where it recovers half the time, but then eventually craps out and the script kicks in. I haven't seen it on my Genoil rig yet since my other script has kept it in check without any softcrash for day 3 now.

A thought about the power draw as threshold measure - it is power limit/card specific and I guess people would need to tune their power threshold to their power limit so I agree it's best to use gpu util. (My cards are at 82W limit for example).
Thoughts?


  
The code checks each cards individually, at times (with Claymore, not Genoil) I've seen that Util (or Powerpraw) is dropping, maybe even below 90 for a few seconds. In order not to generate too much restarts I check all cards. We can lower this or make it so that each of us can decide when it should reboot.
I've combined the restart/reboot so that the first attempt is to restart miner. If that doesn't work, we reboot the machine. We might want to reset the reboot counter after a while, so we don't loose time with a full reboot.

In the first code I checked Powerdraw -> if 30 Watt less than Powerlimit there might be something wrong. Idling cards use around 10 Watt, so that works for all I think. We can combine this with Util if that helps.

So sure we can make it more advanced, we just have to determine the right parameters. Hope others can let us know in what circumstances they see hanging miners. Just one card, or more or everything? Is Util back to zero? or hanging on to 100%?




OK thanks for the clarification, it's really neat and rewarding to see different approaches to this problem Cheesy
Here is why I coded to test that all the cards meet the threshold as one with "if &&": as an example I'll use an event from from my test rig overnight: one card dropped, the "if &&" script waited for claymore to recover for one minute, then booted the system and that was that.
Total down time, 2 mins, if you add the 1 minute of reduced capacity waiting for the miner to right itself, 3 minutes impact.

The "if &&" code does tests for a graceful miner recovery -  by continuing to test the cards for above threshold utilization for 60 seconds after it detects a fault.
If the miner recovers, but just sits there (saw both Claymore/Genoil do exactly that a number of times) that's not good enough and the system gets a boot.
My other miner restart script did not handle this exact case and once every few days I would find the miner sitting pretty and blowing bubbles mining on one or two cards until I noticed because it did not "see" all the cards anymore but it did see some so it thought it "recovered".

If the miner recovers properly, all cards need to hit above threshold  and we can flush the counter and life goes on.
On my test rig, graceful miner recovery occurred 5-6 times in the past 24 hours without prompting a restart - which is desirable above either running at reduced capacity or 5-6 reboots (IMHO).

In contrast - if we test each card independently and increment the error counter one by one until it reaches the number of GPU's, then - depending on the number of cards in the system it could take a long time for all of them to fail - the more cards, the more time to fail (right? am I misunderstanding anything?) So the same event, would unfold differently: the test rig would continue at reduced capacity until COUNT reaches # of GPU's - but since it resets at next check, we can hobble on 5,4,3,2, 1 card until they all die or and the script kicks in or we freeze and require a manual intervention. This could be hours of impact (again if I'm reading this wrong, my apologies, but this is what I'm getting out of looking at it.)

So IMHO, by testing that all the cards meet the 90% utilization threshold (as one, all or nothing = if &&), we avoid hours of impact/decreased capacity. My other concern is that as soon as cards start dropping off one at a time the system gets unstable, increasing the risk of a hang or corrupted file system due to a hard crash.
My view is that it should be cycled at maximum stability for a graceful restart.

Maybe there is a third approach not considered yet, Thoughts?

... edit:
Actually one more thought - I did not test for this yet so I don't know the answer - but in the case where the miner does not see all the cards anymore, does this mean that nvidia-smi ALSO does not see all the cards anymore? If so, and if we get the number of cards from nvidia-smi, wouldn't the script assume that the rig has the right number of cards every time that nvidia-smi stop seeing one? I do recall cards disappearing even from nvidia-smi but I never kept track of this so I don't know how often this condition actually occurs.
  
Thanks for explaining, and you do have valid points here. Like your thinking. I will rework it with this in mind.

Just wondering: Your script reboots the rig, if the miner itself does not recover. Instead we could introduce reloading miner as the first step here. In my experience that resolves the issue almost every time. It will only save 1-2 minutes so it's not a big deal to just reboot (still had the boot time of V0014 in mind).

I did not experience that nvidia-smi looses a card while it's there, but I can imagine that happens with faulty risers. Perhaps we can run the card number count nvidia-smi only at startup the number of cards (saves a call as well) and keep that number during the watchdog process. If we loose a card we do have to reboot anyway.

One other thought: Perhaps it would be an idea to echo the output of the log to a screen (tail -f) so the former reboots are shown as well?


Hi, if it's a Genoil rig I run this setup: https://bitcointalksearch.org/topic/m.19943144 plus the watchdog script being discussed here in separate "screen -dmS" sessions so I have the watchdog and restart scripts running separately.
for that setup I also tail the "ltail" script but if we run only one script then it would make sense to echo some diagnostic output of what faults and recoveries it detects (or log it - but then we need to think about logrotate or someone will run out of space in a few months lol).
For the Claymore setup I only run the watchdog since Claymore has it's own fault detection and it restarts by itself so if the built in restart doesn't work, I cycle the box and log the reboot condition only so I don't have to logrotate.
full member
Activity: 153
Merit: 100
July 11, 2017, 12:48:23 PM
Hey fullzero, i have a question,

without a doubt my biggest problem right now is that when my miner crashes it takes the whole rig down with it, everything gets stuck, SSH barely works, average system load jumps to 14.5!! and Xorg takes up 100% of the CPU, its so bad that none of the standard reboot commands work, they just do nothing, the only thing that actually reboots the rig in this state is "echo b > /proc/sysrq-trigger" so i've set up a script that checks the average system load and if its over 2 it uses the command to reboot, and it works, but i dont like this "solution", yesterday after a reboot nvOC got corrupted somehow, lost my customized oneBash and the whole system became read-only (thankfully i had a oneBash backup that was only a few days behind).

so the question is, what can i do to relive this Xorg error, i run a 7 card rig and never plan on going for a higher number, what can i do with Xorg that would fix this?

Thanks.

@ tempgoga

It seems that whenever a soft crash occurs most of the cards drop to zero, so while the display/keyboard is unresponsive you can catch the soft crash from nvidia-smi. The script below checks card utilization, if it drops below 90% it counts down a minute and if mining hasn't resumed it reboots the system.
This seems to have worked at least once in my case (only got one soft crash this weekend) and the system recovered as expected.
the threshold values work for my setup but others may find different values optimal

Also if anyone knows a way to iterate the if && statements we can get the card count from "cards=$(nvidia-smi -L | wc -l); echo $cards" but the way below also works with manual editing to adjust the watchdog for the number of cards in you individual system.
___________
 
#!/bin/bash
#m1
threshold=90
while sleep 5
 do number=$(nvidia-smi |grep % |awk '{print $13}' |tr -d %)
 set -- $number
 echo -e "$@"
# The "if and" statements below need to be manually adjusted to match the number of cards in your system
# If you have 5 cards, leave is as, if a different number of cards remove or add the && statements as needed as in the example below
        if [[ "$1" -gt "$threshold" ]] && \
           [[ "$2" -gt "$threshold" ]] && \
           [[ "$3" -gt "$threshold" ]] && \
           [[ "$4" -gt "$threshold" ]] && \
           [[ "$5" -gt "$threshold" ]]
# && \
#          [[ "$6" -gt "$threshold" ]]
         then i=12
         echo OK
         else echo $((i--))
        fi
        if [ $i -le 0 ]
         then echo $(date) REBOOT due to soft crash >>~/watchdog.log
         sleep -5
         sudo shutdown now -r
        fi
done
___________

Hey thats funny I just made a script doing something similar, although it checks the powerdraw.
Here it is:
Code:
#!/bin/bash

# Miner restart script V001
# By Maxximus007
# for nvOC by fullzero
#
# POWERLIMIT MUST BE SET IN oneBash

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT_LOW_POWER=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read POWERDRAW POWERLIMIT; } < <( nvidia-smi -i $gpu --query-gpu=power.draw,power.limit --format=csv,noheader,nounits)

  let POWER_DIFF=$( printf "%.0f" $POWERLIMIT )-$( printf "%.0f" $POWERDRAW )

  # If current draw is 30 Watt lower than the limit count them:
  if [ "$POWER_DIFF" -gt "30" ]
  then
    let COUNT_LOW_POWER=COUNT_LOW_POWER+1
  fi

  let gpu=gpu+1
done

if [ $COUNT_LOW_POWER -eq $GPUS ]
then
  echo "$(date) - Power draw is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

You can combine the above with your code, and find the utilization like this:
Code:
nvidia-smi -i 1 --query-gpu=utilization.gpu --format=csv,noheader,nounits
You have to iterate the GPU, starting at 0 to get them all
Okay I've combined the two, perhaps this will work for most of us:
Code:
#!/bin/bash

# Miner restart script V002
# By Maxximus007 && IAmNotAJeep
# for nvOC by fullzero
#

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

MIN_UTIL=90
RESTART=0

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read UTIL; } < <( nvidia-smi -i $gpu --query-gpu=utilization.gpu --format=csv,noheader,nounits)

  let UTILIZATION=$( printf "%.0f" $UTIL )

  # If current utilizations lower than the limit count them:
  if [ $UTILIZATION -lt $MIN_UTIL ]
  then
    let COUNT=COUNT+1
  fi

  let gpu=gpu+1
done

if [ $COUNT -eq $GPUS ]
then
  if [ $RESTART -gt 1 ]
  then
    echo "$(date) - Utilization is too low: reviving did not work so restarting system" | tee -a ${LOG_FILE}
    sudo shutdown now -r
  fi
  echo "$(date) - Utilization is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
  let RESTART=RESTART+1
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

Pretty cool!  I'll try it tonight, lets hope this put the softcrash issues behind us.


I will try this out as well; good work.  Smiley

@ Maxximus007
Thanks for putting these together, great collab!
I'm not a bash expert, so maybe I'm reading this wrong, but here are some thoughts.
The combined code seems to be evaluating each gpu individually for the fault condition to be met, which means if one fails and you have say 5 other cards working then it keeps going until all the cards give reduced output since all of them have to fail individually to increment the counter?So if 5/6 fail we keep going? (Again just looking at it and tracing it in my head so maybe I'm reading wrong).
The way I was thinking about it, is that I wanted all the cards to work at above 90% efficiency and reboot as soon as any card strays beyond the threshold - this is why I did the "if and" statement and didn't iterate though "if" statements alone (I didn't know how to iterate "if and" based on an unknown number of cards lol). I had a version giving 6xOK and such but I think it's more efficient to just get 1xOK if ALL meet the 90% criteria and start the countdown as soon as anything is out of norm - and if the miner recovers, flush the counter. I observed a number of these conditions with Claymore where it recovers half the time, but then eventually craps out and the script kicks in. I haven't seen it on my Genoil rig yet since my other script has kept it in check without any softcrash for day 3 now.

A thought about the power draw as threshold measure - it is power limit/card specific and I guess people would need to tune their power threshold to their power limit so I agree it's best to use gpu util. (My cards are at 82W limit for example).
Thoughts?


  
The code checks each cards individually, at times (with Claymore, not Genoil) I've seen that Util (or Powerpraw) is dropping, maybe even below 90 for a few seconds. In order not to generate too much restarts I check all cards. We can lower this or make it so that each of us can decide when it should reboot.
I've combined the restart/reboot so that the first attempt is to restart miner. If that doesn't work, we reboot the machine. We might want to reset the reboot counter after a while, so we don't loose time with a full reboot.

In the first code I checked Powerdraw -> if 30 Watt less than Powerlimit there might be something wrong. Idling cards use around 10 Watt, so that works for all I think. We can combine this with Util if that helps.

So sure we can make it more advanced, we just have to determine the right parameters. Hope others can let us know in what circumstances they see hanging miners. Just one card, or more or everything? Is Util back to zero? or hanging on to 100%?




OK thanks for the clarification, it's really neat and rewarding to see different approaches to this problem Cheesy
Here is why I coded to test that all the cards meet the threshold as one with "if &&": as an example I'll use an event from from my test rig overnight: one card dropped, the "if &&" script waited for claymore to recover for one minute, then booted the system and that was that.
Total down time, 2 mins, if you add the 1 minute of reduced capacity waiting for the miner to right itself, 3 minutes impact.

The "if &&" code does tests for a graceful miner recovery -  by continuing to test the cards for above threshold utilization for 60 seconds after it detects a fault.
If the miner recovers, but just sits there (saw both Claymore/Genoil do exactly that a number of times) that's not good enough and the system gets a boot.
My other miner restart script did not handle this exact case and once every few days I would find the miner sitting pretty and blowing bubbles mining on one or two cards until I noticed because it did not "see" all the cards anymore but it did see some so it thought it "recovered".

If the miner recovers properly, all cards need to hit above threshold  and we can flush the counter and life goes on.
On my test rig, graceful miner recovery occurred 5-6 times in the past 24 hours without prompting a restart - which is desirable above either running at reduced capacity or 5-6 reboots (IMHO).

In contrast - if we test each card independently and increment the error counter one by one until it reaches the number of GPU's, then - depending on the number of cards in the system it could take a long time for all of them to fail - the more cards, the more time to fail (right? am I misunderstanding anything?) So the same event, would unfold differently: the test rig would continue at reduced capacity until COUNT reaches # of GPU's - but since it resets at next check, we can hobble on 5,4,3,2, 1 card until they all die or and the script kicks in or we freeze and require a manual intervention. This could be hours of impact (again if I'm reading this wrong, my apologies, but this is what I'm getting out of looking at it.)

So IMHO, by testing that all the cards meet the 90% utilization threshold (as one, all or nothing = if &&), we avoid hours of impact/decreased capacity. My other concern is that as soon as cards start dropping off one at a time the system gets unstable, increasing the risk of a hang or corrupted file system due to a hard crash.
My view is that it should be cycled at maximum stability for a graceful restart.

Maybe there is a third approach not considered yet, Thoughts?

... edit:
Actually one more thought - I did not test for this yet so I don't know the answer - but in the case where the miner does not see all the cards anymore, does this mean that nvidia-smi ALSO does not see all the cards anymore? If so, and if we get the number of cards from nvidia-smi, wouldn't the script assume that the rig has the right number of cards every time that nvidia-smi stop seeing one? I do recall cards disappearing even from nvidia-smi but I never kept track of this so I don't know how often this condition actually occurs.
  
Thanks for explaining, and you do have valid points here. Like your thinking. I will rework it with this in mind.

Just wondering: Your script reboots the rig, if the miner itself does not recover. Instead we could introduce reloading miner as the first step here. In my experience that resolves the issue almost every time. It will only save 1-2 minutes so it's not a big deal to just reboot (still had the boot time of V0014 in mind).

I did not experience that nvidia-smi looses a card while it's there, but I can imagine that happens with faulty risers. Perhaps we can run the card number count nvidia-smi only at startup the number of cards (saves a call as well) and keep that number during the watchdog process. If we loose a card we do have to reboot anyway.

One other thought: Perhaps it would be an idea to echo the output of the log to a screen (tail -f) so the former reboots are shown as well?
full member
Activity: 153
Merit: 100
July 11, 2017, 12:28:37 PM
Hi guys,

Using nvOC with latest updates. Great work! I like Salfter nicehash profit switch but i'm getting errors after 10-15 mins which locks the miner.

CUDA error in func 'search'at line 346: an illegal memory access was encountered.

How can i solve that?
This does sounds like too much OC. Try with lower clocks to see if it resolves itself.
newbie
Activity: 2
Merit: 0
July 11, 2017, 11:39:15 AM
Hi guys,

Using nvOC with latest updates. Great work! I like Salfter nicehash profit switch but i'm getting errors after 10-15 mins which locks the miner.

CUDA error in func 'search'at line 346: an illegal memory access was encountered.

How can i solve that?
sr. member
Activity: 1414
Merit: 487
YouTube.com/VoskCoin
July 11, 2017, 09:06:03 AM
I tried mining zec on slushpool, nice site, getting the workers to populate correctly with nvOC was frustrating, and the returns are not as high or consistent than other pools. However I did earn .33 zec when I normally earn .18 zec on their pool with 6.5k sols

What pools are you all using to mine ZEC currently
sr. member
Activity: 1414
Merit: 487
YouTube.com/VoskCoin
July 11, 2017, 08:36:18 AM
Is it possible to run nicehash equihash algo in this version ? If so how exactly do I implement it?

Also with all things the same my miners are performing a hundred sols faster going from 15 to 17 futhermore @fullzero as you suggested this software did fix my failed fan setting issue, at least so far and I did not activate the slow USB command as these Lexar seem to be fast USB!
Jump to: