Pages:
Author

Topic: Smartcoin Linux mining administration. [MULTI-MACHINE SUPPORT NOW IN!] - page 15. (Read 105059 times)

full member
Activity: 238
Merit: 100
Update r468e now available!
- There is a new setting under edit settings.. "Miner output format string". As discussed a couple of posts back, it will let you define how the miner output is displayed on the screen. Use the tags <#hashrate#>, <#accepted#>, <#rejected#>, <#rejected_percent#> where you want these displayed.
- Each miner instance now has rejection percentage calculation


I'd still love to hear from anyone that modifies the "Miner output format string" setting and hear how it worked for them (jaebird - try making a narrower format string Smiley )
newbie
Activity: 56
Merit: 0
@jondecker76,

I've been mining now for quite some time with this setup and it is working great. I like where you are going with the card lockup detection. Is the plan for it to call a script on lockup to allow us to decide what we want to do, ie Reboot or exit smartcoin?

Also, I use openvpn and connectbot on my android phone to monitor/admin miner "boxes" (I say boxes figuratively since there are no pc cases involved Wink I'm wondering if you can detect a screen size of 60x22 or smaller and abbreviate the output so it does not line wrap. For example experimental = exp, Accepted = Acc, Rejected = Rej. My screen size and my eyes make 60x22 a good size to see text but the line wraps mess up the nice formatting.

Thanks again.

jaebird

Yes, I will make hooks to custom scripts to run on failover conditions and lockup conditions.  I like the approach of giving the user the ability to do exactly as they please on these conditions, instead of smartcoin assuming.  

I have also been thinking of ways to make screen size smaller. Using abbreviations isn't a bad Idea, and would quite easy to do. Perhaps I'll make a template file or user setting that can be edited by the user to look something like:
Code:
[<#hashrate#> MHash/sec] [<#accepted#> Accepted] [<#rejected#> Rejected] [<#rejection_percent#>% Rejected]
so that if you prefer, you can change the output template to something like:
Code:
[H:<#hashrate#>] [A:<#accepted#>] [R:<#rejected#>] [%<#rejection_percent#>]
for a much narrower output.  I think I'll add this to the list of planned features!

you don't even have to abbreviate


MHash/sec =MHash/s

[<#temp#> °C]
[<#accepted#> OK]
[<#rejected#> Bad]

so it will looks like this:

Code:
Smartcoin r457e Tue Jul 19 01:15:56 EDT 2011
----------------------------------------------
Host: localhost
G0: 69.00°C  [load: 99.00%] [fan0 100.00%]
G1: 70.00°C  [load: 99.00%] [fan1 100.00%]
G2: 70.00°C  [load: 99.00%] [fan2 100.00%]
G3: 71.00°C  [load: 99.00%] [fan3 100.00%]
CP  55.00°C  [load: 00.78%] [fanC 059.00%]

Profile: Failover
--------BTCGuild--------
G0: 202.97 Mhash/s [2 OK] [0 Bad] [RPC+LP]
G1: 202.93 Mhash/s [1 OK] [0 Bad] [RPC+LP]
G2: 202.96 Mhash/s [3 OK] [0 Bad] [RPC+LP]
G3: 202.99 Mhash/s [1 OK] [0 Bad] [RPC+LP]
CP: 020.80 MHash/s [0 OK] [0 Bad]

Total :
832.65 MHash/s [7 OK] [0 Bad] [0% Bad]


newbie
Activity: 56
Merit: 0
Update r456(Experimental) available
- Fixes false lockup detection on failovers.  What was happening, is when a failover event happens, all previous failed profiles continue to run along with the new ones that fall into the list.  On the failed profiles, every now and then the miner would output a message such as "Could not connect, will retry...". After a while, the entire screen would be filled with this message, and even though the message is pushed out again at regular intervals, the screen appears to never change, and triggers the lockup detection.  This fix works by enabling lockup detection only on instances that are not part of a down profile (lockup detection will still work however, as there will still be a failover profile that is working running at the bottom of the list)

Thanks plantucha for finding this bug!

I was about to describe it better.
But you think and coding faster than I can write.
Thanx
full member
Activity: 238
Merit: 100
Update r457(experimental) now available
- Improved logging on lockup detection (it will put a screen capture of the instance which caused the failover right into the log)
- you can now have a custom lockup script execute automatically on a lockup condition. Here you can send yourself an email, reboot your machine, restart smartcoin, etc. Just add a custom script, "lockup.sh" to your smartcoin install directory.

Lockups are pretty easy to test (if you want to test your lockup script). Just purposely overclock your card to the point it locks up pretty quickly. After 5 minutes or so, the lockup should be detected and you should see your lockup script get executed.
full member
Activity: 238
Merit: 100
@jondecker76,

I've been mining now for quite some time with this setup and it is working great. I like where you are going with the card lockup detection. Is the plan for it to call a script on lockup to allow us to decide what we want to do, ie Reboot or exit smartcoin?

Also, I use openvpn and connectbot on my android phone to monitor/admin miner "boxes" (I say boxes figuratively since there are no pc cases involved Wink I'm wondering if you can detect a screen size of 60x22 or smaller and abbreviate the output so it does not line wrap. For example experimental = exp, Accepted = Acc, Rejected = Rej. My screen size and my eyes make 60x22 a good size to see text but the line wraps mess up the nice formatting.

Thanks again.

jaebird

Yes, I will make hooks to custom scripts to run on failover conditions and lockup conditions.  I like the approach of giving the user the ability to do exactly as they please on these conditions, instead of smartcoin assuming.  

I have also been thinking of ways to make screen size smaller. Using abbreviations isn't a bad Idea, and would quite easy to do. Perhaps I'll make a template file or user setting that can be edited by the user to look something like:
Code:
[<#hashrate#> MHash/sec] [<#accepted#> Accepted] [<#rejected#> Rejected] [<#rejection_percent#>% Rejected]
so that if you prefer, you can change the output template to something like:
Code:
[H:<#hashrate#>] [A:<#accepted#>] [R:<#rejected#>] [%<#rejection_percent#>]
for a much narrower output.  I think I'll add this to the list of planned features!
full member
Activity: 238
Merit: 100
Update r456(Experimental) available
- Fixes false lockup detection on failovers.  What was happening, is when a failover event happens, all previous failed profiles continue to run along with the new ones that fall into the list.  On the failed profiles, every now and then the miner would output a message such as "Could not connect, will retry...". After a while, the entire screen would be filled with this message, and even though the message is pushed out again at regular intervals, the screen appears to never change, and triggers the lockup detection.  This fix works by enabling lockup detection only on instances that are not part of a down profile (lockup detection will still work however, as there will still be a failover profile that is working running at the bottom of the list)

Thanks plantucha for finding this bug!
member
Activity: 79
Merit: 10
@jondecker76,

I've been mining now for quite some time with this setup and it is working great. I like where you are going with the card lockup detection. Is the plan for it to call a script on lockup to allow us to decide what we want to do, ie Reboot or exit smartcoin?

Also, I use openvpn and connectbot on my android phone to monitor/admin miner "boxes" (I say boxes figuratively since there are no pc cases involved Wink I'm wondering if you can detect a screen size of 60x22 or smaller and abbreviate the output so it does not line wrap. For example experimental = exp, Accepted = Acc, Rejected = Rej. My screen size and my eyes make 60x22 a good size to see text but the line wraps mess up the nice formatting.

Thanks again.

jaebird
full member
Activity: 238
Merit: 100
Update r455(experimental) available:
- grep errors are suppressed
- Fixed a small bug that resets the failover counters on smartcoin restart



sensitivity still too high for me.
I have to mine on manual profile with failover every 2-3 hours miners killed => loosing more with failover

it looks like everybody has different threshold, can we set some number in settings or turn it off?
or use different algorithm?
for example:
if it detect locked GPU check again in 5 minutes and again in another 5 minutes
if all 3 checks are positive reboot PC.

also option - just kill smartcoin or reboot PC will be nice.
Because just kill smartcoin only stops not locked GPUs working.
while reboot will unlock locked GPU and autostart mining again.


I can't reproduce anything locally...
Can you give me more information on whats happening?  You mention both the failover system and the lockup detection, but I'm not sure which one you are talking about.  They are 2 separate independent systems (I.e. the lockup system runs no matter which profile is selected, and the failover system only runs when the failover profile is active).  Are you sure you don't have a locked up GPU?  I couldn't imagine a miner running with absolutely no change in its output for 5 minutes straight. 
Also, the new failover will trip a fail condition if stales go over 10% - do you know if this may be the case also?
Is there anything in the logs that may also give a clue?
Perhaps watching the miner output directly (screen -r miner) will give a clue as well - if the miner output sits there with absolutely no change, that pretty much means a locked GPU

manual profile= smartcoin off , running own script
failover was working fine
lockup detection killing miners 2-3 hours    (too sensitive)
rejection rate 0.4%
nothing in logs: just generic   . kiling miners....overclocking...
GPU not locked. it was only once, now I'm making sure it is not.


 I couldn't imagine a miner running with absolutely no change in its output for 5 minutes straight. 

better example what I mean: 

(now - don't LOL yourself to death, this is my first pseudo-code in my life)

start
A=0
detect lockup routine                                 (whatewer formula you are using)
if lockup detected A=A+1                         
If lockup NOT detected go to start             
wait 5 min
detect lockup routine
if lockup detected A=A+1
If lockup NOT detected go to start
wait 5 min
detect lockup routine
if lockup detected A=A+1
If lockup NOT detected go to start
if A=3  then reboot PC                          (lockup is 3x positive in 15 min = it is probably really GPU locked) 
end


testing just once: there will be higher chance false positive or negatives even your formula will be supersmart
only what can happend if it is really GPU locked then will be locked 15 minutes before pc reboots

(5 min is just example, but in case of failover GPUs down, higher rejection..etc) it take 2-3min to system stabilize, so Failover will be not interfering with Lockup detection.

or now maybe even better easier version :
 if is failover starts, turn off lockup detection for 5 min











Actually, the lockup detection is super simple, in pseudo: (keep in mind each iteration is 5-8 seconds)
Code:
if(miner_output_last_iteration == miner_output_this_iteration)
     counter=counter+1
else
     counter=0
endif
if(counter>50)
     # card is locked
     exit
endif


Also consider that the "miner_output" is what is viewable in that miners screen tab - including the hash rate, accepted/rejected counters, and even the messages such accepted/rejected messages, new work messages, connection messages, etc) - so 5 minutes of absolutely no change in output is a very very long time.  Also realize, that in the case of failover or profile change, the miner output does change as far as the lockup detection goes, so the counter is reset - giving 5 minutes for things to stabilize already. Did you watch the miner screen instances to verify that they are not hung at all?  How are you making verifying your GPU isn't locked?  If its taking 2-3 hours, its definitely not  sensitivity issue.


Also, I'm still confused as to what you are saying is happening. You mention manual profile, and failover - which one?  I guess what I need to know is:
- What profile are you running when this happens
- Are you saying that when a true failover condition happens, that its being detected as a lockup condition 5 minutes later? (in fact, I can see where this can happen, as the failed instance continues to run so it can see when it "comes back up"). If this is the case, I think I see a solution Smiley
brand new
Activity: 0
Merit: 0
I am running on r455(experimental), updated through the internal update tool.
I am running smartcoin for several hours before receiving this error:
Quote
Maximum number of clients reached
I am not sure what happened. This happens on two of my rigs at different time.
newbie
Activity: 56
Merit: 0
Update r455(experimental) available:
- grep errors are suppressed
- Fixed a small bug that resets the failover counters on smartcoin restart



sensitivity still too high for me.
I have to mine on manual profile with failover every 2-3 hours miners killed => loosing more with failover

it looks like everybody has different threshold, can we set some number in settings or turn it off?
or use different algorithm?
for example:
if it detect locked GPU check again in 5 minutes and again in another 5 minutes
if all 3 checks are positive reboot PC.

also option - just kill smartcoin or reboot PC will be nice.
Because just kill smartcoin only stops not locked GPUs working.
while reboot will unlock locked GPU and autostart mining again.


I can't reproduce anything locally...
Can you give me more information on whats happening?  You mention both the failover system and the lockup detection, but I'm not sure which one you are talking about.  They are 2 separate independent systems (I.e. the lockup system runs no matter which profile is selected, and the failover system only runs when the failover profile is active).  Are you sure you don't have a locked up GPU?  I couldn't imagine a miner running with absolutely no change in its output for 5 minutes straight. 
Also, the new failover will trip a fail condition if stales go over 10% - do you know if this may be the case also?
Is there anything in the logs that may also give a clue?
Perhaps watching the miner output directly (screen -r miner) will give a clue as well - if the miner output sits there with absolutely no change, that pretty much means a locked GPU

manual profile= smartcoin off , running own script
failover was working fine
lockup detection killing miners 2-3 hours    (too sensitive)
rejection rate 0.4%
nothing in logs: just generic   . kiling miners....overclocking...
GPU not locked. it was only once, now I'm making sure it is not.


 I couldn't imagine a miner running with absolutely no change in its output for 5 minutes straight. 

better example what I mean: 

(now - don't LOL yourself to death, this is my first pseudo-code in my life)

start
A=0
detect lockup routine                                 (whatewer formula you are using)
if lockup detected A=A+1                         
If lockup NOT detected go to start             
wait 5 min
detect lockup routine
if lockup detected A=A+1
If lockup NOT detected go to start
wait 5 min
detect lockup routine
if lockup detected A=A+1
If lockup NOT detected go to start
if A=3  then reboot PC                          (lockup is 3x positive in 15 min = it is probably really GPU locked) 
end


testing just once: there will be higher chance false positive or negatives even your formula will be supersmart
only what can happend if it is really GPU locked then will be locked 15 minutes before pc reboots

(5 min is just example, but in case of failover GPUs down, higher rejection..etc) it take 2-3min to system stabilize, so Failover will be not interfering with Lockup detection.

or now maybe even better easier version :
 if is failover starts, turn off lockup detection for 5 min









full member
Activity: 238
Merit: 100
Update r455(experimental) available:
- grep errors are suppressed
- Fixed a small bug that resets the failover counters on smartcoin restart



sensitivity still too high for me.
I have to mine on manual profile with failover every 2-3 hours miners killed => loosing more with failover

it looks like everybody has different threshold, can we set some number in settings or turn it off?
or use different algorithm?
for example:
if it detect locked GPU check again in 5 minutes and again in another 5 minutes
if all 3 checks are positive reboot PC.

also option - just kill smartcoin or reboot PC will be nice.
Because just kill smartcoin only stops not locked GPUs working.
while reboot will unlock locked GPU and autostart mining again.


I can't reproduce anything locally...
Can you give me more information on whats happening?  You mention both the failover system and the lockup detection, but I'm not sure which one you are talking about.  They are 2 separate independent systems (I.e. the lockup system runs no matter which profile is selected, and the failover system only runs when the failover profile is active).  Are you sure you don't have a locked up GPU?  I couldn't imagine a miner running with absolutely no change in its output for 5 minutes straight. 
Also, the new failover will trip a fail condition if stales go over 10% - do you know if this may be the case also?
Is there anything in the logs that may also give a clue?
Perhaps watching the miner output directly (screen -r miner) will give a clue as well - if the miner output sits there with absolutely no change, that pretty much means a locked GPU
full member
Activity: 238
Merit: 100
I checked the queries being executed by RunSQL and in my case it's a total of 35 queries per refresh (failover profile with 3 workers), 14 of which are queries fetching donation_start and donation_time from the settings table. Like I mentioned earlier, I have very little experience with shell scripting, but this sounds like it should be easy to optimize. It appears DonationActive is called multiple times per refresh so couldn't you just call it once at the start of smartcoin_ops.sh and store the return value in a global variable?

You could also JOIN the device table into the query at the end of GenCurrentProfile and include device.name in the result. That would mean you could get rid of the "SELECT name FROM device WHERE pk_device=$device;" query in ShowStatus. That's one less query per worker per refresh.

I'm sure there are other optimizations that could be done. This is just based on a 20 minute browse of the source code.

Regardless, the error checking you added to RunSQL seems sane. I would probably add a short delay (0.1 seconds perhaps) in the loop so that if the database is locked, it has time to be released before the next attempt is made.

Just my 2 cents.

PS. I hope you had a great time at Cedar Point. I wish there was something of that scale here in Sweden Smiley
Did you get a change to look at this? Doing just the two optimizations I mentioned above would in my case almost halve the number of queries being executed every refresh.

Yeah, at some point I will look into optimization, but not until all the planned features are in (optimization is almost always the last step in the development cycle). Unfortunately, its not a clean and easy thing to do in bash (It'll get messy with a bunch of global variables) - but either way, at the moment we aren't even coming close to it becoming a bottleneck at this point. After the latest changes are tested and stable, and I get multi machine support in, then I'll start looking into making optimizations.
newbie
Activity: 56
Merit: 0
Update r455(experimental) available:
- grep errors are suppressed
- Fixed a small bug that resets the failover counters on smartcoin restart



sensitivity still too high for me.
I have to mine on manual profile with failover every 2-3 hours miners killed => loosing more with failover

it looks like everybody has different threshold, can we set some number in settings or turn it off?
or use different algorithm?
for example:
if it detect locked GPU check again in 5 minutes and again in another 5 minutes
if all 3 checks are positive reboot PC.

also option - just kill smartcoin or reboot PC will be nice.
Because just kill smartcoin only stops not locked GPUs working.
while reboot will unlock locked GPU and autostart mining again.

full member
Activity: 238
Merit: 100
I am running on r455(experimental), updated through the internal update tool.
I am running smartcoin for several hours before receiving this error:
Quote
Maximum number of clients reached
I am not sure what happened. This happens on two of my rigs at different time.

Thats an Xlib error. X only allows for a maximum of 255 clients.
Its cause can be just about anything, from custom temperature monitoring scripts, certain browsers and even the gnome screensaver.  Do a google search for "Maximum number of clients reached" - its pretty common and there is a ton of information on how to troubleshoot the problem.
Personally, I would start with the simple things, such as disabling the screen saver, do a reboot then see what happens from there.

Let me know what you find out!
full member
Activity: 168
Merit: 100
I'll have a steak sandwich and a... steak sandwich
I checked the queries being executed by RunSQL and in my case it's a total of 35 queries per refresh (failover profile with 3 workers), 14 of which are queries fetching donation_start and donation_time from the settings table. Like I mentioned earlier, I have very little experience with shell scripting, but this sounds like it should be easy to optimize. It appears DonationActive is called multiple times per refresh so couldn't you just call it once at the start of smartcoin_ops.sh and store the return value in a global variable?

You could also JOIN the device table into the query at the end of GenCurrentProfile and include device.name in the result. That would mean you could get rid of the "SELECT name FROM device WHERE pk_device=$device;" query in ShowStatus. That's one less query per worker per refresh.

I'm sure there are other optimizations that could be done. This is just based on a 20 minute browse of the source code.

Regardless, the error checking you added to RunSQL seems sane. I would probably add a short delay (0.1 seconds perhaps) in the loop so that if the database is locked, it has time to be released before the next attempt is made.

Just my 2 cents.

PS. I hope you had a great time at Cedar Point. I wish there was something of that scale here in Sweden Smiley
Did you get a change to look at this? Doing just the two optimizations I mentioned above would in my case almost halve the number of queries being executed every refresh.
full member
Activity: 238
Merit: 100
at last not a newbie Wink

JD, i must confirm, that autodonation is functioning like it should

gj = good job

Thanks for the confirmation!
full member
Activity: 238
Merit: 100
I can't recreate the issue here locally - I'm a bit stumped..  Is there anything in the ~/.smartcoin/smartcoin.log files that may give a hint of the problem?
I do have one other puzzling report (in my PM) where the permissions of /var/run/screen get messed up and they have to chmod 777 /var/run/screen to work properly (though this makes entirely no sense at all to me) - perhaps this is the same issue?  (though I'm not certain it is, as your problem is only automatic profile related)


Would you be able to allow me temporary ssh access to see if I can figure it out? (of course, only if you machine is secure and there is no wallet.dat etc. laying around). If so, shoot me a PM and I'll take a peek



EDIT:
https://bugs.launchpad.net/ubuntu/+source/screen/+bug/574773
perhaps running this will clear things up:
Code:
sudo /etc/init.d/screen-cleanup start


newbie
Activity: 41
Merit: 0
at last not a newbie Wink

JD, i must confirm, that autodonation is functioning like it should

gj = good job
member
Activity: 79
Merit: 10
Jon, thanks for the quick reply!
Here are the answers to your questions:
  • I only have one miner configured: phoenix. It is configured as the default miner, and the path and command seem fine.
  • There are no files in /tmp/ named Smartcoin* when the problem occurs. Nothing gets created there when "automatic" profile is selected, however files get created there when I use any of my custom profiles
  • When I do a "screen -ls" I do not see any screens for miners when the automatic profile is selected. However, when I choose one of my custom profiles, I do see the 'miner' screen listed
I'm not sure what's up, but the automatic profile seems broken.
I've confirmed these results on two separate linuxcoin 0.2b1 installs.

At first I thought smartcoin was broken entirely, but when I created my own profiles it worked like a charm Smiley

Keep up the great work,
full member
Activity: 238
Merit: 100
Update r455(experimental) available:
- grep errors are suppressed
- Fixed a small bug that resets the failover counters on smartcoin restart

Pages:
Jump to: