Update r455(experimental) available:
- grep errors are suppressed
- Fixed a small bug that resets the failover counters on smartcoin restart
sensitivity still too high for me.
I have to mine on manual profile with failover every 2-3 hours miners killed => loosing more with failover
it looks like everybody has different threshold, can we set some number in settings or turn it off?
or use different algorithm?
for example:
if it detect locked GPU check again in 5 minutes and again in another 5 minutes
if all 3 checks are positive reboot PC.
also option - just kill smartcoin or reboot PC will be nice.
Because just kill smartcoin only stops not locked GPUs working.
while reboot will unlock locked GPU and autostart mining again.
I can't reproduce anything locally...
Can you give me more information on whats happening? You mention both the failover system and the lockup detection, but I'm not sure which one you are talking about. They are 2 separate independent systems (I.e. the lockup system runs no matter which profile is selected, and the failover system only runs when the failover profile is active). Are you sure you don't have a locked up GPU? I couldn't imagine a miner running with absolutely no change in its output for 5 minutes straight.
Also, the new failover will trip a fail condition if stales go over 10% - do you know if this may be the case also?
Is there anything in the logs that may also give a clue?
Perhaps watching the miner output directly (screen -r miner) will give a clue as well - if the miner output sits there with absolutely no change, that pretty much means a locked GPU
manual profile= smartcoin off , running own script
failover was working fine
lockup detection killing miners 2-3 hours (too sensitive)
rejection rate 0.4%
nothing in logs: just generic . kiling miners....overclocking...
GPU not locked. it was only once, now I'm making sure it is not.
I couldn't imagine a miner running with absolutely no change in its output for 5 minutes straight.
better example what I mean:
(now - don't LOL yourself to death, this is my first pseudo-code in my life)
start
A=0
detect lockup routine (whatewer formula you are using)
if lockup detected A=A+1
If lockup NOT detected go to start
wait 5 min
detect lockup routine
if lockup detected A=A+1
If lockup NOT detected go to start
wait 5 min
detect lockup routine
if lockup detected A=A+1
If lockup NOT detected go to start
if A=3 then reboot PC (lockup is 3x positive in 15 min = it is probably really GPU locked)
end
testing just once: there will be higher chance false positive or negatives even your formula will be supersmart
only what can happend if it is really GPU locked then will be locked 15 minutes before pc reboots
(5 min is just example, but in case of failover GPUs down, higher rejection..etc) it take 2-3min to system stabilize, so Failover will be not interfering with Lockup detection.
or now maybe even better easier version :
if is failover starts, turn off lockup detection for 5 min
Actually, the lockup detection is super simple, in pseudo: (keep in mind each iteration is 5-8 seconds)
if(miner_output_last_iteration == miner_output_this_iteration)
counter=counter+1
else
counter=0
endif
if(counter>50)
# card is locked
exit
endif
Also consider that the "miner_output" is what is viewable in that miners screen tab - including the hash rate, accepted/rejected counters, and even the messages such accepted/rejected messages, new work messages, connection messages, etc) - so 5 minutes of absolutely no change in output is a very very long time. Also realize, that in the case of failover or profile change, the miner output does change as far as the lockup detection goes, so the counter is reset - giving 5 minutes for things to stabilize already. Did you watch the miner screen instances to verify that they are not hung at all? How are you making verifying your GPU isn't locked? If its taking 2-3 hours, its definitely not sensitivity issue.
Also, I'm still confused as to what you are saying is happening. You mention manual profile, and failover - which one? I guess what I need to know is:
- What profile are you running when this happens
- Are you saying that when a true failover condition happens, that its being detected as a lockup condition 5 minutes later? (in fact, I can see where this can happen, as the failed instance continues to run so it can see when it "comes back up"). If this is the case, I think I see a solution