Pages:
Author

Topic: [DEVICE] SimpleRigResetter - Auto reboot crashed rigs - Extension Boards !!! - page 19. (Read 53231 times)

legendary
Activity: 2660
Merit: 1096
Simplemining.net Admin
Hi T - when is the ETA on the new firmware the updated new features?

The 250sec max WD delay is not working well on my R cards but no issues for the RX cards.

I distributed the SRRs according to GPU types, but one SRR unit is managing RX and R cards - I have disabled that SRR for now because it's resetting the R cards at random. Hope I can enable it soon so that the whole farm can achieve 100% uptime.

Thanks

Hi. It wannt problem of long start Smiley
After all i made mistake and package that needs to be present for sending keepallive packets was installed only in RX version.
v1075 SimpleMiningOS fixes that !
legendary
Activity: 1834
Merit: 1080
---- winter*juvia -----
Hi T - when is the ETA on the new firmware the updated new features?

The 250sec max WD delay is not working well on my R cards but no issues for the RX cards.

I distributed the SRRs according to GPU types, but one SRR unit is managing RX and R cards - I have disabled that SRR for now because it's resetting the R cards at random. Hope I can enable it soon so that the whole farm can achieve 100% uptime.

Thanks
legendary
Activity: 2660
Merit: 1096
Simplemining.net Admin

T - can you consider the following features:

1. counter/countdown displayed for circles/rigs that has begun reset/power-cycle process. This way, at dashboard we can see what process has begun and have taken place. This is key for troubleshooting. WIthout this, I dont even know if SRR is working or running -- and assume its working. Over at smOS web dashboard, it would have been nice to have the power/reset control available but as explained by you -- smOS don't communicate with SRR.

2. for me, when SRR need to be activated on a rig that means miner has failed to restart the miner on its own, meaning problem may be related to other than miner, eg. memory leaks/corruption, network issues etc. Instead of reset, I would almost certainly prefer power cycle -- meaning power down and x minutes late power up.

3. When running Claymore miners, I also have his monitoring utility running in the background. This way I keep tabs with whats going on at miner app level. Claymore can also do run bat file when miner hangs due to oppencl errors - not sure how you plan to use this current utility to somewhat compatible with SRR for Claymore miners only. Maybe this is not necessary but his utility has many key info like how many times the miner resetted, how many time pool failovered etc.

4. I have re-arranged my SRR units and slot numbering again so that I didnt miss anything - so that I can correctly activate all SRRs to value more than 0 for wd delay; but I think I really need the "2500 secs" watchdog delay because my non-RX cards are STILL being triggered to reset unnecessarily (or too early). I cant figure out why this situation doesnt effect my RX480 rigs -- they respond well to SRR even at 250 secs. I can only conclude this may be how my network environment is laid out. So looking forward to the v2 fo SRR and enhancements.


1. See what i can do, its good feature.
About controlling SRR i will be getting into it, i have 2 concepts of doing that and i will implement one or even two of them.
a) sending commands that for example to shutdown rig number 1 but sending this command thru another working rig.
b) it is propably possible to make SRR comunicating with my SimpleMiningOS dashboard, i will look into this mater and see if i can implement it but dont know ETA.

2. In my case, like 80% problems are solved by fast restart, and only some like 20% need cold reset (shutdown, wait 5 minutes, power on)
I will check if i can implement some kind of feature to preset this by own needs.

3. If gpu error occurs AND you have set (-w1 -r1 in command) the  it will run reboot.sh in which there are special commands that will FORCE REBOOT your rig, and this is working pretty well most cases.
In some cases if this will freeze rig, then now SRR kicks in by not getting keep allive messages.
So in other words, software on rig side can solve most of problems and doest that, but if it fails and the rigs wont send keepallive message within specified number of seconds then in second line SRR starts rebooting.
I think that we dont need to make it compatible as every of those features works on different level (first software reboot, and if it wont help then SRR hardware fast reboot and if it 4 tome fails then hardware cold reboot)
Isnt that best idea ?
I will try to make SRR restarts counter, but i will also do the restarts counter in SMOS dashboard which will be looking at rig uptime, if in next report rig will show less uptime than last one - it means reboot.
I was thinking about it and i will do this.

4. 2500 seconds will be in this next release. already doing that.
Also i might know why R OS is booting longer.
Its booting Graphical enviroment and THEN it starts running SRR agent. It takes lot more time.
in RX OS there is no graphical interface so the booting process is much faster.
I thing i can speed up this agent script under R OS so you wont have this issue. thx for reporting.
legendary
Activity: 1834
Merit: 1080
---- winter*juvia -----
There will be very soon New version of SRR Tool + Firmware

Things that will be changed:
- instead of RED/GREEN circles there will be LIST also RED/GREEN with possibility to write custom names near each rig. Those info will be saved in SRR Tool directory. (wont be on device itself)
- max watchdog delay increased from 250 seconds to 2500 seconds (some windows rigs on HDD needs more time to boot up and 250 isnt enough for them)
- access to SRR via website (more and more functions will be with every release) - for now most crucial like on/off/reboot/long/reboot and status view.

Logic of auto rebooting rigs:
Curently: after timeout of "watchhdog delay" SRR will do fast reboot, again wait specified number of seconds in "watchdog delay" and so on......
New Logic: First 4 times it will fast reset rig, then if it still wont receive any keepallive packets it will be doing infinite number of times cold reset ( also name changed from "long reset" to "cold reset" as it is more obvious.)

The question is:
What is most desired behaviour for auto rebooting rigs ?
4 times fast reset and then infinite long resets is good choice ?
Ofcourse if after 2nd reset rig will be allive then in next failure it will start from the beginning: 4 fast resets and infinite cold resets.
Or some other configuration ? If you have other idea then tell me why you think that ?
Of confirm that you think its ok ?

Also if you have other requests and ideas and i forgot to write them down, please tell me or remind me Smiley


T - can you consider the following features:

1. counter/countdown displayed for circles/rigs that has begun reset/power-cycle process. This way, at dashboard we can see what process has begun and have taken place. This is key for troubleshooting. WIthout this, I dont even know if SRR is working or running -- and assume its working. Over at smOS web dashboard, it would have been nice to have the power/reset control available but as explained by you -- smOS don't communicate with SRR.

2. for me, when SRR need to be activated on a rig that means miner has failed to restart the miner on its own, meaning problem may be related to other than miner, eg. memory leaks/corruption, network issues etc. Instead of reset, I would almost certainly prefer power cycle -- meaning power down and x minutes late power up.

3. When running Claymore miners, I also have his monitoring utility running in the background. This way I keep tabs with whats going on at miner app level. Claymore can also do run bat file when miner hangs due to oppencl errors - not sure how you plan to use this current utility to somewhat compatible with SRR for Claymore miners only. Maybe this is not necessary but his utility has many key info like how many times the miner resetted, how many time pool failovered etc.

4. I have re-arranged my SRR units and slot numbering again so that I didnt miss anything - so that I can correctly activate all SRRs to value more than 0 for wd delay; but I think I really need the "2500 secs" watchdog delay because my non-RX cards are STILL being triggered to reset unnecessarily (or too early). I cant figure out why this situation doesnt effect my RX480 rigs -- they respond well to SRR even at 250 secs. I can only conclude this may be how my network environment is laid out. So looking forward to the v2 fo SRR and enhancements.

legendary
Activity: 2660
Merit: 1096
Simplemining.net Admin
There will be very soon New version of SRR Tool + Firmware

Things that will be changed:
- instead of RED/GREEN circles there will be LIST also RED/GREEN with possibility to write custom names near each rig. Those info will be saved in SRR Tool directory. (wont be on device itself)
- max watchdog delay increased from 250 seconds to 2500 seconds (some windows rigs on HDD needs more time to boot up and 250 isnt enough for them)
- access to SRR via website (more and more functions will be with every release) - for now most crucial like on/off/reboot/long/reboot and status view.

Logic of auto rebooting rigs:
Curently: after timeout of "watchhdog delay" SRR will do fast reboot, again wait specified number of seconds in "watchdog delay" and so on......
New Logic: First 4 times it will fast reset rig, then if it still wont receive any keepallive packets it will be doing infinite number of times cold reset ( also name changed from "long reset" to "cold reset" as it is more obvious.)

The question is:
What is most desired behaviour for auto rebooting rigs ?
4 times fast reset and then infinite long resets is good choice ?
Ofcourse if after 2nd reset rig will be allive then in next failure it will start from the beginning: 4 fast resets and infinite cold resets.
Or some other configuration ? If you have other idea then tell me why you think that ?
Of confirm that you think its ok ?

Also if you have other requests and ideas and i forgot to write them down, please tell me or remind me Smiley


legendary
Activity: 2660
Merit: 1096
Simplemining.net Admin
After full 24++ hours -- my farm is all green with SRR!

I know of 1 or 2 problematic rigs that would turn red after a day or two -- and I could see that SRR dealt with it!

I have 33 rigs and 30 are connected to 4 units of SRR in a single network environment.

My settings :
0 seconds / disabled for watchdog max delay
3mins long reset.
Save and write the above info to SRR.
Set the rig details in smOS SRR menu.

After further clarifications from Tytanick -- I now understand how this gadget works.

Very simple solution made into a product. Easier said than done though.

Glad that Tytanick made the effort to make this product a reality.

More improvements to come - I hope the features discussed in the thread will be in next release


Hi.
You can check which rigs rebooted by going to dashboard and fovering over "ONLINE" - you will see uptime.

Also if you have 0 seconds max delay in resetter then its disabled.
So if you want to use it then you need to writle like 250 here ? You know that right ?
legendary
Activity: 1834
Merit: 1080
---- winter*juvia -----
After full 24++ hours -- my farm is all green with SRR!

I know of 1 or 2 problematic rigs that would turn red after a day or two -- and I could see that SRR dealt with it!

I have 33 rigs and 30 are connected to 4 units of SRR in a single network environment.

My settings :
0 seconds / disabled for watchdog max delay
3mins long reset.
Save and write the above info to SRR.
Set the rig details in smOS SRR menu.

After further clarifications from Tytanick -- I now understand how this gadget works.

Very simple solution made into a product. Easier said than done though.

Glad that Tytanick made the effort to make this product a reality.

More improvements to come - I hope the features discussed in the thread will be in next release
legendary
Activity: 1834
Merit: 1080
---- winter*juvia -----
Any chance of an API or letting us know some "hack" to control the rigs through some script we have running (other than killing SRR-Agent-Linux-v2.sh on each machine)?

I can quickly see where I will be monitoring my rigs at a pool and will need to reboot rigs based on what a script sees at the pool.  The GUI is nice for lots of people, but some way to talk to the SRR through code for me takes it to the next level! Smiley  Any chance of an API or simply some hints at how to hack talking to the SRR?  For example, someway to talk to the SRR and tell it to reboot machine #5 would be ideal.

Also, what is the login/password for the SRR?  If I can get on the SRR I am sure I could hack something together.

Overall, it looks to be a great product!

Yes, sure.
You can hack. There is already API which is using SRR Tool for communicating and local agent scripts.
Ok so here is reset script:

Code:
#!/bin/bash

# REQUIRED packages: socat

# INPUTY
serial="000002"
port="3"

serial=`echo $serial | xargs`
port=`echo $port | xargs`
port=`printf %02X $(( ${port} -1 ))`

firstByte="FF"
byteCount="0008"
action="53"
mac="485053$serial"

checksum=`printf %02X $(( (0x${byteCount:0:2} + 0x${byteCount:2:2} + 0x$action + 0x${mac:0:2} + 0x${mac:2:2} + 0x${mac:4:2} + 0x${mac:6:2} + 0x${mac:8:2} + 0x${mac:10:2} + 0x$port)%0x100  ))`
packet="$firstByte$byteCount$action$mac$port$checksum"
echo "Wysyłam pakiet o następującej zawartości: $packet"

echo -n "$packet" | xxd -r -p |socat - UDP-DATAGRAM:255.255.255.255:1051,broadcast


If you want to turn ON rig then change action to 51
If you want to turn OFF rig then change action to 52

Here is list you can easly use with this script Smiley
0x51 - turn on rig
0x52 - turn off rig
0x53 - fast reset rig
0x55 - keepalive
0x58 - long reset

Rest is little bit more difficult like reading and writing settings Smiley - for this better use Windows Tool Smiley


Beautiful, this is exactly what I needed.  With 51, 52 & 55 you can do just about everything so that is all I need.  Thank you very much Tytanick!

Hey dance191 -- please do share your app once completed. Thanks!
sr. member
Activity: 261
Merit: 250
Any chance of an API or letting us know some "hack" to control the rigs through some script we have running (other than killing SRR-Agent-Linux-v2.sh on each machine)?

I can quickly see where I will be monitoring my rigs at a pool and will need to reboot rigs based on what a script sees at the pool.  The GUI is nice for lots of people, but some way to talk to the SRR through code for me takes it to the next level! Smiley  Any chance of an API or simply some hints at how to hack talking to the SRR?  For example, someway to talk to the SRR and tell it to reboot machine #5 would be ideal.

Also, what is the login/password for the SRR?  If I can get on the SRR I am sure I could hack something together.

Overall, it looks to be a great product!

Yes, sure.
You can hack. There is already API which is using SRR Tool for communicating and local agent scripts.
Ok so here is reset script:

Code:
#!/bin/bash

# REQUIRED packages: socat

# INPUTY
serial="000002"
port="3"

serial=`echo $serial | xargs`
port=`echo $port | xargs`
port=`printf %02X $(( ${port} -1 ))`

firstByte="FF"
byteCount="0008"
action="53"
mac="485053$serial"

checksum=`printf %02X $(( (0x${byteCount:0:2} + 0x${byteCount:2:2} + 0x$action + 0x${mac:0:2} + 0x${mac:2:2} + 0x${mac:4:2} + 0x${mac:6:2} + 0x${mac:8:2} + 0x${mac:10:2} + 0x$port)%0x100  ))`
packet="$firstByte$byteCount$action$mac$port$checksum"
echo "Wysyłam pakiet o następującej zawartości: $packet"

echo -n "$packet" | xxd -r -p |socat - UDP-DATAGRAM:255.255.255.255:1051,broadcast


If you want to turn ON rig then change action to 51
If you want to turn OFF rig then change action to 52

Here is list you can easly use with this script Smiley
0x51 - turn on rig
0x52 - turn off rig
0x53 - fast reset rig
0x55 - keepalive
0x58 - long reset

Rest is little bit more difficult like reading and writing settings Smiley - for this better use Windows Tool Smiley


Beautiful, this is exactly what I needed.  With 51, 52 & 55 you can do just about everything so that is all I need.  Thank you very much Tytanick!
legendary
Activity: 2660
Merit: 1096
Simplemining.net Admin
Tytanick....

on the smOS, is it possible to show SRR activities.

For example:

When SRR is triggered, maybe that row for that a rig, has some indicator/flashing to show that SRR in is progress.

I have my dashboard displayed 24x7 for monitoring purposes and it will be good to see SRR is working.

Thanks

That would be difficult to do on SM OS as SRR isnt communicating with it.
To test if its working its best to unplug rig 1, and see if it reboots after specified time, then rig 2 etc etc Smiley
Also there wound need to be hard hang of rig becasue if only gpu hangs and system sees that it triggers software linux reboot.
Anyway try to uplugging RJ45 from the rig you want to check if it will reset.

Can the SRR turn on a rig from a cold start automatically?
Right now not.
Still thinkg what is multi best option.
Now its normal resetting in loop.
hero member
Activity: 672
Merit: 500
Tytanick....

on the smOS, is it possible to show SRR activities.

For example:

When SRR is triggered, maybe that row for that a rig, has some indicator/flashing to show that SRR in is progress.

I have my dashboard displayed 24x7 for monitoring purposes and it will be good to see SRR is working.

Thanks

That would be difficult to do on SM OS as SRR isnt communicating with it.
To test if its working its best to unplug rig 1, and see if it reboots after specified time, then rig 2 etc etc Smiley
Also there wound need to be hard hang of rig becasue if only gpu hangs and system sees that it triggers software linux reboot.
Anyway try to uplugging RJ45 from the rig you want to check if it will reset.

Can the SRR turn on a rig from a cold start automatically?
legendary
Activity: 2660
Merit: 1096
Simplemining.net Admin
Tytanick....

on the smOS, is it possible to show SRR activities.

For example:

When SRR is triggered, maybe that row for that a rig, has some indicator/flashing to show that SRR in is progress.

I have my dashboard displayed 24x7 for monitoring purposes and it will be good to see SRR is working.

Thanks

That would be difficult to do on SM OS as SRR isnt communicating with it.
To test if its working its best to unplug rig 1, and see if it reboots after specified time, then rig 2 etc etc Smiley
Also there wound need to be hard hang of rig becasue if only gpu hangs and system sees that it triggers software linux reboot.
Anyway try to uplugging RJ45 from the rig you want to check if it will reset.
legendary
Activity: 2660
Merit: 1096
Simplemining.net Admin
Any chance of an API or letting us know some "hack" to control the rigs through some script we have running (other than killing SRR-Agent-Linux-v2.sh on each machine)?

I can quickly see where I will be monitoring my rigs at a pool and will need to reboot rigs based on what a script sees at the pool.  The GUI is nice for lots of people, but some way to talk to the SRR through code for me takes it to the next level! Smiley  Any chance of an API or simply some hints at how to hack talking to the SRR?  For example, someway to talk to the SRR and tell it to reboot machine #5 would be ideal.

Also, what is the login/password for the SRR?  If I can get on the SRR I am sure I could hack something together.

Overall, it looks to be a great product!

Yes, sure.
You can hack. There is already API which is using SRR Tool for communicating and local agent scripts.
Ok so here is reset script:

Code:
#!/bin/bash

# REQUIRED packages: socat

# INPUTY
serial="000002"
port="3"

serial=`echo $serial | xargs`
port=`echo $port | xargs`
port=`printf %02X $(( ${port} -1 ))`

firstByte="FF"
byteCount="0008"
action="53"
mac="485053$serial"

checksum=`printf %02X $(( (0x${byteCount:0:2} + 0x${byteCount:2:2} + 0x$action + 0x${mac:0:2} + 0x${mac:2:2} + 0x${mac:4:2} + 0x${mac:6:2} + 0x${mac:8:2} + 0x${mac:10:2} + 0x$port)%0x100  ))`
packet="$firstByte$byteCount$action$mac$port$checksum"
echo "Wysyłam pakiet o następującej zawartości: $packet"

echo -n "$packet" | xxd -r -p |socat - UDP-DATAGRAM:255.255.255.255:1051,broadcast


If you want to turn ON rig then change action to 51
If you want to turn OFF rig then change action to 52

Here is list you can easly use with this script Smiley
0x51 - turn on rig
0x52 - turn off rig
0x53 - fast reset rig
0x55 - keepalive
0x58 - long reset

Rest is little bit more difficult like reading and writing settings Smiley - for this better use Windows Tool Smiley
legendary
Activity: 1834
Merit: 1080
---- winter*juvia -----
Tytanick....

on the smOS, is it possible to show SRR activities.

For example:

When SRR is triggered, maybe that row for that a rig, has some indicator/flashing to show that SRR in is progress.

I have my dashboard displayed 24x7 for monitoring purposes and it will be good to see SRR is working.

Thanks
sr. member
Activity: 261
Merit: 250
Any chance of an API or letting us know some "hack" to control the rigs through some script we have running (other than killing SRR-Agent-Linux-v2.sh on each machine)?

I can quickly see where I will be monitoring my rigs at a pool and will need to reboot rigs based on what a script sees at the pool.  The GUI is nice for lots of people, but some way to talk to the SRR through code for me takes it to the next level! Smiley  Any chance of an API or simply some hints at how to hack talking to the SRR?  For example, someway to talk to the SRR and tell it to reboot machine #5 would be ideal.

Also, what is the login/password for the SRR?  If I can get on the SRR I am sure I could hack something together.

Overall, it looks to be a great product!
legendary
Activity: 1834
Merit: 1080
---- winter*juvia -----
hero member
Activity: 672
Merit: 500
been playing around with the SRR trying to gather up a good review

one thing is will the SRR turn on a rig from a complete power down state?
if so how? i cant make it turn on a rig other than doing it manual from the SRR config tool
legendary
Activity: 2660
Merit: 1096
Simplemining.net Admin
as for the max 250 time  you have a problem
 
but at min  0 time no problems

test 240 see if problems like that of 250 show up

test 90 see if problems show up

test 5 see if problems show up
I spoke with my SRR developer, there will be max 2500 seconds in watchdog in next Firmware release.
We have limit of 255 value, so we need to to it *10.
So the setting will be 10,20,30,40 seconds....... 2500seconds
When time comes and i will release first firmware upgrade, i will write special instructions with changelog and what you need to know before and after upgrade.
legendary
Activity: 4256
Merit: 8551
'The right to privacy matters'
as for the max 250 time  you have a problem
 
but at min  0 time no problems

test 240 see if problems like that of 250 show up

test 90 see if problems show up

test 5 see if problems show up
legendary
Activity: 2660
Merit: 1096
Simplemining.net Admin
Pages:
Jump to: