Pages:
Author

Topic: How to reboot a rig if it stops mining? - page 2. (Read 11967 times)

newbie
Activity: 14
Merit: 0
September 03, 2017, 07:35:08 PM
#16
As of V9.6, Claymore has a -minspeed option that will reboot the system if the speed hasn't been reached for > 5 mins.
legendary
Activity: 1078
Merit: 1011
September 01, 2017, 08:29:44 AM
#15
Most of these issues stem from rig instability. Instead of trying to find these elaborate workarounds to reboot your rig when the instability causes it to lockup or crash, you may be better off tracking down the reason it is unstable in the first place. Most of the time this usually comes down to heat and power, and occasional bad drivers or some other reason. Heat and power can be either direct, the cards running too hot or being on a borderline power issue, such as weak PSU, loose or undersized wires or connectors. It can also be indirect as a result of too much overclocking or too much undervolting.

Anyway, I think if you can track down the culprit your issues will go away and any loss of hashrate, for instance if you need to back down on the overclock, would be more than offset by your increased stability and up-time. Myself and I am sure many others have rigs that go for weeks or months without the need for a reboot, and usually then it because of a needed maintenance (software upgrade, scheduled cleaning, etc.) rather than because of something unexpected.
hero member
Activity: 1274
Merit: 556
September 01, 2017, 07:59:11 AM
#14
You can also use speedfan.
I've got it running on one of my rigs. It's set to monitor temps of the GPUs... and if either goes below 55C during at least 5 minutes, the rig's rebooted.
newbie
Activity: 11
Merit: 0
September 01, 2017, 06:13:23 AM
#13
hero member
Activity: 552
Merit: 500
Anyone? It seems to have gotten worse in the last few updates..
hero member
Activity: 552
Merit: 500
I'm also having a lot of issues lately, the rig ran fine for days, and lately its been getting the open gl gpu hang issue.

I tried a few things..

I created a reboot script that calls another script which then launches the miner but for some reason while relaunching the miner it gets stuck on creating a dag for one.

Im slowly adjusting the cards but they ran fine for days.. Here is a bit of my config also which Im tweaking.. its becoming pretty annoying.  I wish someone had a program to watch this and kill it properly , wait then relaunch.  I see the
script above but not i'm going that route yet.

reboot.bat file

Code:
timeout 5 > NUL
start c:\\Claymore\"reboot-miner.bat"
exit 0

reboot-miner.bat

Code:
taskkill /IM EthDcrMiner64.exe /F
timeout 15 > NUL
start c:\\Claymore\"start-classic.bat"
exit 0

Here making sure that the miner is killed and with the /F its forced along with a 15 second timeout.

Issues can be that when this happens it kills the miner but 9 out of 10 times its stuck at creating the dag.  I'm assuming some gpu memory or driver hasnt reset yet.

also the reboot.bat command prompt doesn't close even though exit 0 is added to it.. any thoughts?

Here are some config vars I'm using

Code:
EthDcrMiner64.exe -wd 0 -mport 0 -r 1 -dcri 26 -tt 70 -eres 0

Testing now with: wd 0 to at least not restart the miner and get stuck, until I can figure out how to properly reset everything and not have it hang on restart with dag creation.

-r 1 calls the reboot.bat file
-dcri value I also play with.

temps are fine

I'm also going to test using -ethi 6 which the default is 8.

Anyone have any other thoughts or suggestions? I'm using the latest miner, I'm sure Claymore is aware of this and it would be good to fix the hanging miner restart ..

Cheers!
legendary
Activity: 4354
Merit: 9201
'The right to privacy matters'
March 29, 2017, 06:33:07 PM
#10
I'm using modded RX480 8GB with Samsung memory and out of 24 cards, 2 on different rigs make problems. Yes, one option is to revert to stock BIOS, but I'm looking for another solution since that happens maybe once per day. Win 10 Pro.

Claymore's Dual Eth+Dcr miner shows 28.4MH/s for all 5 cards in the rig, and randomly, as I said, once per day, one card in the setup starts to drop to 21MH/s and eventually after few minutes 0MH/s. Then miner's watchdog tries to restart mining which goes well, but is stuck on the list of GPU's and doesn't start mining. The solution is to manually close the miner and just start it again.

EDIT: To make it clear, the rig doesn't freeze. Only the miner stops mining. So I can manually restart the miner WITHOUT rebooting the rig, but that was one option to restart the miner as well.

Now I'm searching for a solution which would reboot the whole rig (or just kill the miner process and start it again) if the GPU temperature falls below say 60°C ... something opposite to the overheating protection Smiley

Any ideas?

Yeah set the clay more. To -ethi 6. And -dcri 20. See if it holds

This would be in your bat file.

If you look at page 1 post one. Claymore has many options. The best thing is get it stable .

Auto reboot is the last choice.

If you read all the settings he has. You can isolate a single card and a lot of other stuff.

That one card is an under performing card
sr. member
Activity: 689
Merit: 253
March 29, 2017, 06:25:40 PM
#9
I use simple rig resetter for this, its on this forum and simplemining.net
newbie
Activity: 53
Merit: 0
March 29, 2017, 05:59:10 PM
#8
Thank you all, I'll first try with -r 1 and reboot.bat, if it doesn't work then I'll try the PowerShell script!

Hopefully the simpler solution works well Smiley
legendary
Activity: 1726
Merit: 1018
March 29, 2017, 04:42:39 PM
#7
I have a DIY solution for this that I made up that you can use.  Basically I made a powershell script that checks GPU usage percentage and based on results it reboots the computer, although you can modify that to restart the mining program as well.  I also have something I use to restart the mining program instead of reboot the computer since sometimes that is enough.

The difficulty with restarting the mining app is shutting it down.  I use something called closeprog.exe to shut down the mining application.  I set a scheduled task that runs a script when the video driver crashes.  That script uses closeprog.exe to kill the mining app and then rerun it.  I can't remember where I got this closeprog.exe, I have had it for years and used it for various scripted things.  You can probably figure out how to use taskkill to do that also.

To monitor GPU usage and reboot when a specific GPU drops out I use openhardwaremonitor.  You can get that here: http://openhardwaremonitor.org/
That thing basically exposes the GPU sensor stats to windows management instrumentation which powershell can work with.  The powershell script runs from a scheduled task every 10 minutes.  It cycles through all of the GPU's looking for low usage results.  If it gets 9 low results in a row it reboots the whole rig.  The reason I do 9 results is because if a driver crash causes the other scheduled task to restart the mining application while this script is testing GPU usage then I can sometimes wind up with some low readings while things are being restarted by the other task.  Also the GPU usage will dip normally during certain work restarts from the pool.  So I want to be sure I am really seeing a consistent low result before I reboot.  The PS script is this:

Code:
$Log = "LogFile.log"
$Date = Get-Date
$TestValue = 0

#Test if openhardwaremonitor is running and if not, start it
$ProcessName = "openhardwaremonitor"

    if((get-process $ProcessName -ErrorAction SilentlyContinue) -eq $Null)
    { Start-Process -FilePath ".\OpenHardwareMonitor\OpenHardwareMonitor.exe" -WindowStyle Minimized}
else
    { echo "Process is already running" }

#if the computer just started it will get zeros while the miner is still getting the dag file ready so we wait
Start-Sleep 120

#Check GPU load

$FirstGPULoads = Get-WmiObject -namespace root\openhardwaremonitor -class sensor | Where-Object {$_.SensorType -Match "load" -and $_.Identifier -like "*gpu*"}

ForEach($GPU In $FirstGPULoads)
     {if($GPU.value -lt 10)
        {$FirstGPULoadValue = $GPU.value
        Write-Host $FirstGPULoadValue "seems low"
        #"$Date - Low result obtained $FirstGPULoadValue" >> $Log
        $TestValue = $TestValue + 1                                      
        }
      else
        {$FirstGPULoadValue = $GPU.value
        Write-Host $FirstGPULoadValue "seems fine"  
        }
     }

#if we have bad timing on a driver crash (and recovery) or work restarts we may get low results so we wait between tests
Start-Sleep 20

$SecondGPULoads = Get-WmiObject -namespace root\openhardwaremonitor -class sensor | Where-Object {$_.SensorType -Match "load" -and $_.Identifier -like "*gpu*"}

ForEach($GPU In $SecondGPULoads)
     {if($GPU.value -lt 10)
        {$SecondGPULoadValue = $GPU.value
        Write-Host $SecondGPULoadValue "seems low"
        #"$Date - Low result obtained $SecondGPULoadValue" >> $Log
        $TestValue = $TestValue + 1                                      
        }
      else
        {$SecondGPULoadValue = $GPU.value
        Write-Host $SecondGPULoadValue "seems fine"  
        }
     }

#if we have bad timing on a driver crash (and recovery) or work restarts we may get low results so we wait between tests
Start-Sleep 20


$ThirdGPULoads = Get-WmiObject -namespace root\openhardwaremonitor -class sensor | Where-Object {$_.SensorType -Match "load" -and $_.Identifier -like "*gpu*"}

ForEach($GPU In $ThirdGPULoads)
     {if($GPU.value -lt 10)
        {$ThirdGPULoadValue = $GPU.value
        Write-Host $ThirdGPULoadValue "seems low"
        #"$Date - Low result obtained $ThirdGPULoadValue" >> $Log
        $TestValue = $TestValue + 1                                      
        }
      else
        {$ThirdGPULoadValue = $GPU.value
        Write-Host $ThirdGPULoadValue "seems fine"  
        }
     }


#if we have bad timing on a driver crash (and recovery) or work restarts we may get low results so we wait between tests
Start-Sleep 20


$FourthGPULoads = Get-WmiObject -namespace root\openhardwaremonitor -class sensor | Where-Object {$_.SensorType -Match "load" -and $_.Identifier -like "*gpu*"}

ForEach($GPU In $FourthGPULoads)
     {if($GPU.value -lt 10)
        {$FourthGPULoadValue = $GPU.value
        Write-Host $FourthGPULoadValue "seems low"
        #"$Date - Low result obtained $FourthGPULoadValue" >> $Log
        $TestValue = $TestValue + 1                                      
        }
      else
        {$FourthGPULoadValue = $GPU.value
        Write-Host $FourthGPULoadValue "seems fine"  
        }
     }


#if we have bad timing on a driver crash (and recovery) or work restarts we may get low results so we wait between tests
Start-Sleep 20


$FifthGPULoads = Get-WmiObject -namespace root\openhardwaremonitor -class sensor | Where-Object {$_.SensorType -Match "load" -and $_.Identifier -like "*gpu*"}

ForEach($GPU In $FifthGPULoads)
     {if($GPU.value -lt 10)
        {$FifthGPULoadValue = $GPU.value
        Write-Host $FifthGPULoadValue "seems low"
        #"$Date - Low result obtained $FifthGPULoadValue" >> $Log
        $TestValue = $TestValue + 1                                      
        }
      else
        {$FifthGPULoadValue = $GPU.value
        Write-Host $FifthGPULoadValue "seems fine"  
        }
     }


#if we have bad timing on a driver crash (and recovery) or work restarts we may get low results so we wait between tests
Start-Sleep 20


$SixthGPULoads = Get-WmiObject -namespace root\openhardwaremonitor -class sensor | Where-Object {$_.SensorType -Match "load" -and $_.Identifier -like "*gpu*"}

ForEach($GPU In $SixthGPULoads)
     {if($GPU.value -lt 10)
        {$SixthGPULoadValue = $GPU.value
        Write-Host $SixthGPULoadValue "seems low"
        #"$Date - Low result obtained $SixthGPULoadValue" >> $Log
        $TestValue = $TestValue + 1                                      
        }
      else
        {$SixthGPULoadValue = $GPU.value
        Write-Host $SixthGPULoadValue "seems fine"  
        }
     }


#if we have bad timing on a driver crash (and recovery) or work restarts we may get low results so we wait between tests
Start-Sleep 20


$SeventhGPULoads = Get-WmiObject -namespace root\openhardwaremonitor -class sensor | Where-Object {$_.SensorType -Match "load" -and $_.Identifier -like "*gpu*"}

ForEach($GPU In $SeventhGPULoads)
     {if($GPU.value -lt 10)
        {$SeventhGPULoadValue = $GPU.value
        Write-Host $SeventhGPULoadValue "seems low"
        #"$Date - Low result obtained $SeventhGPULoadValue" >> $Log
        $TestValue = $TestValue + 1                                      
        }
      else
        {$SeventhGPULoadValue = $GPU.value
        Write-Host $SeventhGPULoadValue "seems fine"  
        }
     }


#if we have bad timing on a driver crash (and recovery) or work restarts we may get low results so we wait between tests
Start-Sleep 20


$EightthGPULoads = Get-WmiObject -namespace root\openhardwaremonitor -class sensor | Where-Object {$_.SensorType -Match "load" -and $_.Identifier -like "*gpu*"}

ForEach($GPU In $EightthGPULoads)
     {if($GPU.value -lt 10)
        {$EightthGPULoadValue = $GPU.value
        Write-Host $EightthGPULoadValue "seems low"
        #"$Date - Low result obtained $EightthGPULoadValue" >> $Log
        $TestValue = $TestValue + 1                                      
        }
      else
        {$EightthGPULoadValue = $GPU.value
        Write-Host $EightthGPULoadValue "seems fine"  
        }
     }


#if we have bad timing on a driver crash (and recovery) or work restarts we may get low results so we wait between tests
Start-Sleep 20


$NinthGPULoads = Get-WmiObject -namespace root\openhardwaremonitor -class sensor | Where-Object {$_.SensorType -Match "load" -and $_.Identifier -like "*gpu*"}

ForEach($GPU In $NinthGPULoads)
     {if($GPU.value -lt 10)
        {$NinthGPULoadValue = $GPU.value
        Write-Host $NinthGPULoadValue "seems low"
        #"$Date - Low result obtained $NinthGPULoadValue" >> $Log
        $TestValue = $TestValue + 1                                      
        }
      else
        {$NinthGPULoadValue = $GPU.value
        Write-Host $NinthGPULoadValue "seems fine"  
        }
     }

#all nine tests have to get a low result restart
if($TestValue -gt 8)
     {"$Date - Obtained $TestValue low results - Seems dead - restarting" >> $Log
      Restart-Computer -force
     }
else
     {"$Date - Obtained $TestValue low results - Seems ok" >> $Log}
        

Each of the nine GPU tests will test all GPU's and if any of them gives a result less than 10% it increments the TestValue variable.  If the TestValue variable gets a value of nine at the end that means each of the nine tests resulted in at least one GPU reporting a usage of under 10%.  Since there are also pauses while it waits in between tests that works pretty well for me in terms of only doing a reboot when a GPU has well and truly crashed.  The openhardwaremonitor app has to be in a subdirectory called openhardwaremonitor underneath where this powershell script runs (or you need to change the path).  First it checks if that app is running and if it isn't it runs it.  That way you never have to bother with actually making sure the openhardwaremonitor app is running.  The script also has a long pause in the beginning so that if the computer did get rebooted and it runs right after reboot it waits a minute for the miner apps to get started (because you have those start automatically at boot right?)

To run a powershell script from a scheduled task I use a batch file to run the PS script (I know convoluted but it works).  The batch file that runs the PS script is in the same directory as the PS script and it looks like this:
powershell.exe .\GPU_Monitor.ps1

EDIT: I was just looking over my log file for this and I see that it actually increments the TestValue variable for every single low reading so if you have 6 GPU's (I don't) then it could actually hit 9 low results quite easily if the test was run at an inopportune moment (like during a video driver crash).  You can change the value in the line towards the end to determine how many low results you need to initiate a reboot.  The line that determines that is
Code:
if($TestValue -gt 8)
just change 8 to whatever.  You can comment out the restart also by putting a # in front of it.  I ran this for a couple days with the restart command commented out to be sure it was really only going to restart when I wanted it to.  When I was satisfied that it wasn't going to cause a lot of unnecessary reboots I removed the comment on that line so it could restart the rig.  But looking at my log for the past few days it looks like the script could use some refinement.
newbie
Activity: 1
Merit: 0
March 29, 2017, 02:48:48 PM
#6
I believe there is a command inside Claymore for this problem.
Code:
-r   Restart miner mode. "-r 0" (default) - restart miner if something wrong with GPU. "-r -1" - disable automatic restarting. -r >20 - restart miner if something
   wrong with GPU or by timer. For example, "-r 60" - restart miner every hour or when some GPU failed.
   "-r 1" closes miner and execute "reboot.bat" file ("reboot.bash" or "reboot.sh" for Linux version) in the miner directory (if exists) if some GPU failed.
   So you can create "reboot.bat" file and perform some actions, for example, reboot system if you put this line there: "shutdown /r /t 5 /f".
newbie
Activity: 53
Merit: 0
March 29, 2017, 02:35:11 PM
#5
But the miner doesn't quits, if it exited then there would be no problem. Internal watchdog restarts it without killing the process. It just stops everything and tries to start again in the same process which results in not starting again and being stuck on this screen:

https://i.imgur.com/JCOpGQy.jpg

It just remains stuck on that and doesn't continue mining till I close it manually by clicking the X button and starting it again...
legendary
Activity: 2002
Merit: 1051
ICO? Not even once.
March 29, 2017, 02:00:16 PM
#4
But the thing is the miner doesn't freezes the rig, I can teamviewer into it and just restart the miner. So no need for IP power switches... the thing is I notice that the mining stopped 10 hours after it stopped. Need something to do the resetting for me automatically...

Run the miner in a loop, on windows create a bat file like mining.bat:

:start
miner.exe....
goto start


That way when miner.exe stops, it will start again.


If the miner stops but the process doesn't end you can use a second batch file (eg. miner_restart.bat) with something like this:

:start
timeout -t 600
taskkill -t -f /im miner.exe
goto start


This will kill miner.exe every 600 seconds which will restart because of the first bat.
newbie
Activity: 53
Merit: 0
March 29, 2017, 11:55:37 AM
#3
But the thing is the miner doesn't freezes the rig, I can teamviewer into it and just restart the miner. So no need for IP power switches... the thing is I notice that the mining stopped 10 hours after it stopped. Need something to do the resetting for me automatically...
newbie
Activity: 31
Merit: 0
March 29, 2017, 11:52:13 AM
#2
Dakky,

Get yourself an IP power switch and set your BIOSes to start on power loss. Then it's just a matter of logging into the power switch and power cycling.

We actually know each other from another forum Smiley

If you want to see how it looks like poke me
newbie
Activity: 53
Merit: 0
March 29, 2017, 11:45:04 AM
#1
I'm using modded RX480 8GB with Samsung memory and out of 24 cards, 2 on different rigs make problems. Yes, one option is to revert to stock BIOS, but I'm looking for another solution since that happens maybe once per day. Win 10 Pro.

Claymore's Dual Eth+Dcr miner shows 28.4MH/s for all 5 cards in the rig, and randomly, as I said, once per day, one card in the setup starts to drop to 21MH/s and eventually after few minutes 0MH/s. Then miner's watchdog tries to restart mining which goes well, but is stuck on the list of GPU's and doesn't start mining. The solution is to manually close the miner and just start it again.

EDIT: To make it clear, the rig doesn't freeze. Only the miner stops mining. So I can manually restart the miner WITHOUT rebooting the rig, but that was one option to restart the miner as well.

Now I'm searching for a solution which would reboot the whole rig (or just kill the miner process and start it again) if the GPU temperature falls below say 60°C ... something opposite to the overheating protection Smiley

Any ideas?
Pages:
Jump to: