Pages:
Author

Topic: [OS] nvOC easy-to-use Linux Nvidia Mining - page 30. (Read 418244 times)

newbie
Activity: 44
Merit: 0
 Not sure this will be relevant but your system of 12 cards using one 850 psu for the risers is how I killed two psu. I only had 7 cards running and it was too much for the Sata channels to handle. One psu will be good for 6 sata devices so I think you are browning out the main psu. Once I dropped back on card count, no more problems. In the end, I gave up on the Server psu's and went with dual 1000w psu's.

thay


The 850W PSU only runs the motherboard and one card and it's riser. I have three PSU's per rig. Two 1600W (6 cards and their risers each) and the 850W as described. I am no were near the amount of power the power supplies are capable of, so should be good there.

This is one of the rigs - https://imgur.com/a/WRfTksJ

That is a really nice (and big) rig. Did you find anything in syslog (as per my previous post)?

m1@m1-desktop:~$ sudo less /var/log/dmesg
(Nothing has been logged yet.)

Is all I get. I am using 19.2
member
Activity: 224
Merit: 13
 Not sure this will be relevant but your system of 12 cards using one 850 psu for the risers is how I killed two psu. I only had 7 cards running and it was too much for the Sata channels to handle. One psu will be good for 6 sata devices so I think you are browning out the main psu. Once I dropped back on card count, no more problems. In the end, I gave up on the Server psu's and went with dual 1000w psu's.

thay


The 850W PSU only runs the motherboard and one card and it's riser. I have three PSU's per rig. Two 1600W (6 cards and their risers each) and the 850W as described. I am no were near the amount of power the power supplies are capable of, so should be good there.

This is one of the rigs - https://imgur.com/a/WRfTksJ

That is a really nice (and big) rig. Did you find anything in syslog (as per my previous post)?
newbie
Activity: 44
Merit: 0
 Not sure this will be relevant but your system of 12 cards using one 850 psu for the risers is how I killed two psu. I only had 7 cards running and it was too much for the Sata channels to handle. One psu will be good for 6 sata devices so I think you are browning out the main psu. Once I dropped back on card count, no more problems. In the end, I gave up on the Server psu's and went with dual 1000w psu's.

thay


The 850W PSU only runs the motherboard and one card and it's riser. I have three PSU's per rig. Two 1600W (6 cards and their risers each) and the 850W as described. I am no were near the amount of power the power supplies are capable of, so should be good there.

This is one of the rigs - https://imgur.com/a/WRfTksJ
member
Activity: 224
Merit: 13
 Not sure this will be relevant but your system of 12 cards using one 850 psu for the risers is how I killed two psu. I only had 7 cards running and it was too much for the Sata channels to handle. One psu will be good for 6 sata devices so I think you are browning out the main psu. Once I dropped back on card count, no more problems. In the end, I gave up on the Server psu's and went with dual 1000w psu's.

thay


That is a good point, Thay. I was very surprised to learn how much some GPUs pull through the PCI slot and also that it varies by GPU and MFR. If we assume 50w per card (may even be low) and another 80w for the mobo, that gives us 680w. That is what I would consider to be the max usable for a 850w PSU (80% of rated output).

EDIT: IMO, trying to get a 12 GPU rig to be rock stable is an exercise in frustration. I have read dozens of messages on this thread (and others) of folks having all kinds of oddball problems with rigs that use more than 8 GPUs. This includes not only nvOC, but also Windows, SMOS, and HiveOS. It can be done, but it comes with an inherent lack of stability. If I were you, I would split it up into 2 rigs and get on with mining.
newbie
Activity: 64
Merit: 0
  Not sure this will be relevant but your system of 12 cards using one 850 psu for the risers is how I killed two psu. I only had 7 cards running and it was too much for the Sata channels to handle. One psu will be good for 6 sata devices so I think you are browning out the main psu. Once I dropped back on card count, no more problems. In the end, I gave up on the Server psu's and went with dual 1000w psu's.

thay
member
Activity: 224
Merit: 13
I have two 1600W server PSU's for the cards (6 on each) one 850W EVGA ATX with one card / riser.

All using 6-8pin, risers are split once (1 cable per 2 risers)

Any chance it's a memory issue? I am running 4GB, my memory says 3.2GB / 4GB in use (85%).

Since disabling watchdog, it shouldn't be a server issue. The miner would just keep trying to reconnect.

Check the logs (in /var/log). What does syslog say?
newbie
Activity: 44
Merit: 0
Error on rig 2 - Two different rigs, crashing within a minute of each other. Tell me that isn't weird.

tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
m1@m1-desktop:~$ tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
Watchdog for nvOC v0019-2.0 - Community Release
Version: v0019-2.0.011

LOG FILE: (Showing the last 10 recorded entries)
| 12  |    120W     |  3.42 Sol/W  |
+-----+-------------+--------------+
INFO 09:34:50: GPU3 Accepted share 186ms [A:454, R:1]
INFO 09:34:51: GPU7 Accepted share 187ms [A:477, R:1]
CRITICAL: Sun Apr 29 09:35:17 MST 2018 - GPU Utilization is too low: restarting 3main...
Mon Apr 30 22:35:29 MST 2018 - Lost GPU so restarting system. Found GPU's:
Unable to determine the device handle for GPU 0000:0F:00.0: GPU is lost.  Reboot the system to recover this GPU


Mon Apr 30 22:35:30 MST 2018 - reboot in 10 seconds


If both rigs crash and freeze at the same time, it can be electrical problem
I had almost same issue a while back and some of my rigs were crashing all at the same time,
found out when one of the room venting fans was turning on it was making a high frequency noise in electricity and 3-4 rigs gets the lost gpu at the same time and reboot.
After a month of pulling my hairs to find the problem I changed that fan and problem solved.


Open 5watchdog
Change:
Code:
        echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}
To:
Code:
        echo "$(date) - Lost GPU $GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

So you can check GPU number that is lost, then check if it is the same GPU always get lost?
If its always the same GPU, remove it from the rig and check, may be a faulty GPU, riser or power cable.
If after removing the GPU, the problem jumps to another GPU then it could be a power problem.

Hmmmm interesting about the electrical issue. I have two 8" hyperfans used as exhaust fans that run 100% 24/7. But could be a possibility.

Last night I recompiled the miners again, set the max fan limit to 90 from another one of your posts and set the power restore to 80 and changed EWBF to 3_3 from 3_4.

I will try this for a day and see if anything happens. It's just so strange that on Hush, it worked completely fine.

I had this kinda happen before and it was the mining server "disconnecting". Switched pools and all was good.

Any way to make watchdog wait an extended period of time for error's to clear themselves before trying to restart 3main?

Also, do the miners themselves if watch dog is disabled keep a max temp limit? I notice when starting EWBF it says max temp 90*. While I don't want temps that high, if it keeps the miner going and safe then I will consider it.


I have two 20 inches fans with a thermostat that start them when room temp goes above 15C and stop at 10C, and one of them causing problem at startup.
Anyway, I suggest you try out dstm zm miner, much better in my experience.
What are your PSUs? and how much you draw from them? and what about riser powers?

If both rigs restart 3main at same time when mining some coins and all is good with other coin try to change the pool.
Recently I'm getting lots of pool disconnect from MPH and all rigs restart 3main almost at same time.

I have two 1600W server PSU's for the cards (6 on each) one 850W EVGA ATX with one card / riser.

All using 6-8pin, risers are split once (1 cable per 2 risers)

Any chance it's a memory issue? I am running 4GB, my memory says 3.2GB / 4GB in use (85%).

Since disabling watchdog, it shouldn't be a server issue. The miner would just keep trying to reconnect.
newbie
Activity: 44
Merit: 0
So it looks like the server is temp disconnecting and crashing the miners. This is going to happen. So I am going to disable watchdog and just let it re-connect on it's own.

I was on MPH and switched to miningspeed. Same issue. Maybe my rigs just don't like to be restarted and like to work all the time. lol




Latest Errors -

LOG FILE: (Showing the last 10 recorded entries)
CUDA: Device: 8 Thread exited with code: 46
CUDA: Device: 7 Thread exited with code: 46
CUDA: Device: 11 Thread exited with code: 46
CUDA: Device: 12 User selected solver: 0
CUDA: Device: 3 User selected solver: 0
CUDA: Device: 4 User selected solver: 0
CUDA: Device: 12 Thread exited with code: 46
CUDA: Device: 3 Thread exited with code: 46
CUDA: Device: 4 Thread exited with code: 46
CRITICAL: Tue May  1 15:17:20 MST 2018 - GPU Utilization is too low: restarting 3main...



LOG FILE: (Showing the last 10 recorded entries)
+-------------------------------------------------+
INFO: Server: mining.miningspeed.com:3062
INFO: Solver Auto.
INFO: Devices: All.
INFO: Temperature limit: 90
INFO: Api: Disabled
---------------------------------------------------
ERROR: Cannot connect to the server. 1
CRITICAL: Tue May  1 04:00:44 MST 2018 - GPU Utilization is too low: restarting 3main...
WARNING: Tue May  1 16:03:20 MST 2018 - Internet is down, checking...






Well... Disabling watchdog didn't work. Still crashed Blah
newbie
Activity: 4
Merit: 0
Hi,

I noticed that there are a few double periods in the 0miner file:

Line 883 onwards:
if [ $COIN == "PASC" ]
then
  HCD='/home/m1/pasc/sgminer'
  ADDR="$PASC_ADDRESS..$PASC_WORKER"


Is this right? Or should I reduce it to a single period?






If all of the double periods follow that example where they are between the address and worker name, then yes. There should only be a single period between them.

Thanks for the clarification - anyway to make sure this gets included in the next update? I was wondering why my worker didn't show up.
newbie
Activity: 44
Merit: 0
So it looks like the server is temp disconnecting and crashing the miners. This is going to happen. So I am going to disable watchdog and just let it re-connect on it's own.

I was on MPH and switched to miningspeed. Same issue. Maybe my rigs just don't like to be restarted and like to work all the time. lol




Latest Errors -

LOG FILE: (Showing the last 10 recorded entries)
CUDA: Device: 8 Thread exited with code: 46
CUDA: Device: 7 Thread exited with code: 46
CUDA: Device: 11 Thread exited with code: 46
CUDA: Device: 12 User selected solver: 0
CUDA: Device: 3 User selected solver: 0
CUDA: Device: 4 User selected solver: 0
CUDA: Device: 12 Thread exited with code: 46
CUDA: Device: 3 Thread exited with code: 46
CUDA: Device: 4 Thread exited with code: 46
CRITICAL: Tue May  1 15:17:20 MST 2018 - GPU Utilization is too low: restarting 3main...



LOG FILE: (Showing the last 10 recorded entries)
+-------------------------------------------------+
INFO: Server: mining.miningspeed.com:3062
INFO: Solver Auto.
INFO: Devices: All.
INFO: Temperature limit: 90
INFO: Api: Disabled
---------------------------------------------------
ERROR: Cannot connect to the server. 1
CRITICAL: Tue May  1 04:00:44 MST 2018 - GPU Utilization is too low: restarting 3main...
WARNING: Tue May  1 16:03:20 MST 2018 - Internet is down, checking...




full member
Activity: 686
Merit: 140
Linux FOREVER! Resistance is futile!!!
Error on rig 2 - Two different rigs, crashing within a minute of each other. Tell me that isn't weird.

tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
m1@m1-desktop:~$ tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
Watchdog for nvOC v0019-2.0 - Community Release
Version: v0019-2.0.011

LOG FILE: (Showing the last 10 recorded entries)
| 12  |    120W     |  3.42 Sol/W  |
+-----+-------------+--------------+
INFO 09:34:50: GPU3 Accepted share 186ms [A:454, R:1]
INFO 09:34:51: GPU7 Accepted share 187ms [A:477, R:1]
CRITICAL: Sun Apr 29 09:35:17 MST 2018 - GPU Utilization is too low: restarting 3main...
Mon Apr 30 22:35:29 MST 2018 - Lost GPU so restarting system. Found GPU's:
Unable to determine the device handle for GPU 0000:0F:00.0: GPU is lost.  Reboot the system to recover this GPU


Mon Apr 30 22:35:30 MST 2018 - reboot in 10 seconds


If both rigs crash and freeze at the same time, it can be electrical problem
I had almost same issue a while back and some of my rigs were crashing all at the same time,
found out when one of the room venting fans was turning on it was making a high frequency noise in electricity and 3-4 rigs gets the lost gpu at the same time and reboot.
After a month of pulling my hairs to find the problem I changed that fan and problem solved.


Open 5watchdog
Change:
Code:
        echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}
To:
Code:
        echo "$(date) - Lost GPU $GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

So you can check GPU number that is lost, then check if it is the same GPU always get lost?
If its always the same GPU, remove it from the rig and check, may be a faulty GPU, riser or power cable.
If after removing the GPU, the problem jumps to another GPU then it could be a power problem.

Hmmmm interesting about the electrical issue. I have two 8" hyperfans used as exhaust fans that run 100% 24/7. But could be a possibility.

Last night I recompiled the miners again, set the max fan limit to 90 from another one of your posts and set the power restore to 80 and changed EWBF to 3_3 from 3_4.

I will try this for a day and see if anything happens. It's just so strange that on Hush, it worked completely fine.

I had this kinda happen before and it was the mining server "disconnecting". Switched pools and all was good.

Any way to make watchdog wait an extended period of time for error's to clear themselves before trying to restart 3main?

Also, do the miners themselves if watch dog is disabled keep a max temp limit? I notice when starting EWBF it says max temp 90*. While I don't want temps that high, if it keeps the miner going and safe then I will consider it.


I have two 20 inches fans with a thermostat that start them when room temp goes above 15C and stop at 10C, and one of them causing problem at startup.
Anyway, I suggest you try out dstm zm miner, much better in my experience.
What are your PSUs? and how much you draw from them? and what about riser powers?

If both rigs restart 3main at same time when mining some coins and all is good with other coin try to change the pool.
Recently I'm getting lots of pool disconnect from MPH and all rigs restart 3main almost at same time.
member
Activity: 224
Merit: 13
Error on rig 2 - Two different rigs, crashing within a minute of each other. Tell me that isn't weird.

tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
m1@m1-desktop:~$ tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
Watchdog for nvOC v0019-2.0 - Community Release
Version: v0019-2.0.011

LOG FILE: (Showing the last 10 recorded entries)
| 12  |    120W     |  3.42 Sol/W  |
+-----+-------------+--------------+
INFO 09:34:50: GPU3 Accepted share 186ms [A:454, R:1]
INFO 09:34:51: GPU7 Accepted share 187ms [A:477, R:1]
CRITICAL: Sun Apr 29 09:35:17 MST 2018 - GPU Utilization is too low: restarting 3main...
Mon Apr 30 22:35:29 MST 2018 - Lost GPU so restarting system. Found GPU's:
Unable to determine the device handle for GPU 0000:0F:00.0: GPU is lost.  Reboot the system to recover this GPU


Mon Apr 30 22:35:30 MST 2018 - reboot in 10 seconds


If both rigs crash and freeze at the same time, it can be electrical problem
I had almost same issue a while back and some of my rigs were crashing all at the same time,
found out when one of the room venting fans was turning on it was making a high frequency noise in electricity and 3-4 rigs gets the lost gpu at the same time and reboot.
After a month of pulling my hairs to find the problem I changed that fan and problem solved.


Open 5watchdog
Change:
Code:
        echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}
To:
Code:
        echo "$(date) - Lost GPU $GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

So you can check GPU number that is lost, then check if it is the same GPU always get lost?
If its always the same GPU, remove it from the rig and check, may be a faulty GPU, riser or power cable.
If after removing the GPU, the problem jumps to another GPU then it could be a power problem.

Hmmmm interesting about the electrical issue. I have two 8" hyperfans used as exhaust fans that run 100% 24/7. But could be a possibility.

Last night I recompiled the miners again, set the max fan limit to 90 from another one of your posts and set the power restore to 80 and changed EWBF to 3_3 from 3_4.

I will try this for a day and see if anything happens. It's just so strange that on Hush, it worked completely fine.

I had this kinda happen before and it was the mining server "disconnecting". Switched pools and all was good.

Any way to make watchdog wait an extended period of time for error's to clear themselves before trying to restart 3main?

Also, do the miners themselves if watch dog is disabled keep a max temp limit? I notice when starting EWBF it says max temp 90*. While I don't want temps that high, if it keeps the miner going and safe then I will consider it.

The watchdog and the temp control are 2 different scripts so even if you disable the watchdog, the temp control will still do its thing. If you want to expand the time between checks for the watchdog, change the interval of the main loop. At the bottom of the script, you will see this line:
Code:
sleep 10

Change this to a larger value like 15 or 20. NOTE that increasing this value on a rig with a lot of GPUs will dramatically increase the amount of time before the watchdog bounces the miner in the event that a problem is detected on a single GPU.
newbie
Activity: 44
Merit: 0
Error on rig 2 - Two different rigs, crashing within a minute of each other. Tell me that isn't weird.

tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
m1@m1-desktop:~$ tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
Watchdog for nvOC v0019-2.0 - Community Release
Version: v0019-2.0.011

LOG FILE: (Showing the last 10 recorded entries)
| 12  |    120W     |  3.42 Sol/W  |
+-----+-------------+--------------+
INFO 09:34:50: GPU3 Accepted share 186ms [A:454, R:1]
INFO 09:34:51: GPU7 Accepted share 187ms [A:477, R:1]
CRITICAL: Sun Apr 29 09:35:17 MST 2018 - GPU Utilization is too low: restarting 3main...
Mon Apr 30 22:35:29 MST 2018 - Lost GPU so restarting system. Found GPU's:
Unable to determine the device handle for GPU 0000:0F:00.0: GPU is lost.  Reboot the system to recover this GPU


Mon Apr 30 22:35:30 MST 2018 - reboot in 10 seconds


If both rigs crash and freeze at the same time, it can be electrical problem
I had almost same issue a while back and some of my rigs were crashing all at the same time,
found out when one of the room venting fans was turning on it was making a high frequency noise in electricity and 3-4 rigs gets the lost gpu at the same time and reboot.
After a month of pulling my hairs to find the problem I changed that fan and problem solved.


Open 5watchdog
Change:
Code:
        echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}
To:
Code:
        echo "$(date) - Lost GPU $GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

So you can check GPU number that is lost, then check if it is the same GPU always get lost?
If its always the same GPU, remove it from the rig and check, may be a faulty GPU, riser or power cable.
If after removing the GPU, the problem jumps to another GPU then it could be a power problem.

Hmmmm interesting about the electrical issue. I have two 8" hyperfans used as exhaust fans that run 100% 24/7. But could be a possibility.

Last night I recompiled the miners again, set the max fan limit to 90 from another one of your posts and set the power restore to 80 and changed EWBF to 3_3 from 3_4.

I will try this for a day and see if anything happens. It's just so strange that on Hush, it worked completely fine.

I had this kinda happen before and it was the mining server "disconnecting". Switched pools and all was good.

Any way to make watchdog wait an extended period of time for error's to clear themselves before trying to restart 3main?

Also, do the miners themselves if watch dog is disabled keep a max temp limit? I notice when starting EWBF it says max temp 90*. While I don't want temps that high, if it keeps the miner going and safe then I will consider it.
full member
Activity: 686
Merit: 140
Linux FOREVER! Resistance is futile!!!
Error on rig 2 - Two different rigs, crashing within a minute of each other. Tell me that isn't weird.

tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
m1@m1-desktop:~$ tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
Watchdog for nvOC v0019-2.0 - Community Release
Version: v0019-2.0.011

LOG FILE: (Showing the last 10 recorded entries)
| 12  |    120W     |  3.42 Sol/W  |
+-----+-------------+--------------+
INFO 09:34:50: GPU3 Accepted share 186ms [A:454, R:1]
INFO 09:34:51: GPU7 Accepted share 187ms [A:477, R:1]
CRITICAL: Sun Apr 29 09:35:17 MST 2018 - GPU Utilization is too low: restarting 3main...
Mon Apr 30 22:35:29 MST 2018 - Lost GPU so restarting system. Found GPU's:
Unable to determine the device handle for GPU 0000:0F:00.0: GPU is lost.  Reboot the system to recover this GPU


Mon Apr 30 22:35:30 MST 2018 - reboot in 10 seconds


If both rigs crash and freeze at the same time, it can be electrical problem
I had almost same issue a while back and some of my rigs were crashing all at the same time,
found out when one of the room venting fans was turning on it was making a high frequency noise in electricity and 3-4 rigs gets the lost gpu at the same time and reboot.
After a month of pulling my hairs to find the problem I changed that fan and problem solved.


Open 5watchdog
Change:
Code:
        echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}
To:
Code:
        echo "$(date) - Lost GPU $GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

So you can check GPU number that is lost, then check if it is the same GPU always get lost?
If its always the same GPU, remove it from the rig and check, may be a faulty GPU, riser or power cable.
If after removing the GPU, the problem jumps to another GPU then it could be a power problem.
newbie
Activity: 44
Merit: 0
Error on rig 2 - Two different rigs, crashing within a minute of each other. Tell me that isn't weird.

tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
m1@m1-desktop:~$ tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
Watchdog for nvOC v0019-2.0 - Community Release
Version: v0019-2.0.011

LOG FILE: (Showing the last 10 recorded entries)
| 12  |    120W     |  3.42 Sol/W  |
+-----+-------------+--------------+
INFO 09:34:50: GPU3 Accepted share 186ms [A:454, R:1]
INFO 09:34:51: GPU7 Accepted share 187ms [A:477, R:1]
CRITICAL: Sun Apr 29 09:35:17 MST 2018 - GPU Utilization is too low: restarting 3main...
Mon Apr 30 22:35:29 MST 2018 - Lost GPU so restarting system. Found GPU's:
Unable to determine the device handle for GPU 0000:0F:00.0: GPU is lost.  Reboot the system to recover this GPU


Mon Apr 30 22:35:30 MST 2018 - reboot in 10 seconds
newbie
Activity: 44
Merit: 0
Hey guys,

Anyone else having an issue mining ETH (and ZCL now) and the rig crashing / freezing?

Seems that something happen on the ETH miner, watchdog thinks it lost a GPU and tries to restart 3main. But on both of my rigs, when this happens the system freezes and I have to manually restart.

Errors I have seen -

GPU Utilization is low: restarting 3main...

Thread exited with code: 29

Is there a way to disable 3main restarting without disabling watchdog? The miner would start mining again once connected, but this freezing is my issue.

I'm mining hush temporarily with zero issues. Was on ZCL before the fork without issues as well, same settings and whatnot. And was mining ETH for about 3 weeks with no issues. Now it won't stop crashing after being up 12-24 hours each time. It's driving me nuts.

Two separate rigs, both 13 card (1070 and 1070 Ti's)

The only time I have ever had a rig freeze was ultimately due to OC being too high. Lowering the GPU OC by just a bit (5 or 10) fixed my issue.

I am not sure I totally understand your question, but if you want to disable the the 3main restart, the only way to do this is to not run the watchdog at all. To do that, change this in 1bash:

MINER_WATCHDOG="YES"

to

MINER_WATCHDOG="NO"



Shouldn't be an overclock issue. Running Hush right now for two days straight no issue.

Running 50 core and 200 mem at 80% TDP.

Def do not want to turn off watchdog. I've ran with these settings fine for multiple coins and long periods of time. ETH was running for over a month with zero issues and now all the sudden with the same settings, same miner, different rigs crashes every 12ish hours within minutes of eachother. Even switched servers and same issue. Something else is at play here.

Like I mentioned, my personal desktop that mines I see ethminer restart randomly exactly when the rigs go down. It doesn't make much sense.

I had a similar issue to this a few months ago and was told it was an issue within NVOC. That when the miner switched to "donation" mode it would freeze and the solution was to switch miners. I've tried ETHMINER, GENOIL and CLAYMORE. All the same issue.

Oh, the miner you are referring to was an older version of the DSTM miner. The dev only had one pool configured for his donations and a network issue in Europe hosed a bunch of us for hours early one morning. That was fixed in a newer DSTM miner version, several versions ago. That was not a freeze. It was just the miner trying over and over to connect to something that it could not. When the watchdog saw that the GPUs were idle, it would restart the miner a few times and ultimately the box. This went on for hours and hours and even destroyed the boot USB drive for some folks running an older nvOC version.

Have you checked to see what the system logs (/var/logs) say? I am assuming when you say "freeze" that the entire host becomes unresponsive and has to be hard rebooted.

Yeah it completely freeze's and becomes unresponsive. Here is the error after turning on the logs.

tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
m1@m1-desktop:~$ tail -f  /home/m1/nvoc_logs/watchdog-screenlog.0
Watchdog for nvOC v0019-2.0 - Community Release
Version: v0019-2.0.011

LOG FILE: (Showing the last 10 recorded entries)
|  7  |    112W     |  3.87 Sol/W  |
|  8  |    116W     |  4.26 Sol/W  |
|  9  |     98W     |  3.82 Sol/W  |
| 10  |    120W     |  3.65 Sol/W  |
| 11  |    121W     |  3.68 Sol/W  |
| 12  |    118W     |  3.58 Sol/W  |
+-----+-------------+--------------+
CRITICAL: Sun Apr 29 09:37:52 MST 2018 - GPU Utilization is too low: restarting 3main...
WARNING: Mon Apr 30 02:25:19 MST 2018 - Internet is down, checking...
WARNING: Mon Apr 30 09:38:05 MST 2018 - Internet is down, checking...
 
Mon Apr 30 22:34:23 MST 2018 - No mining issues detected.
GPU UTILIZATION:  100 96 100 100 100 99 97 100 100 98 100 99 99
      GPU_COUNT:  13
 
Mon Apr 30 22:34:43 MST 2018 - GPU 2 under threshold found - GPU UTILIZATION:   59
Mon Apr 30 22:34:43 MST 2018 - GPU 3 under threshold found - GPU UTILIZATION:   0
Mon Apr 30 22:34:44 MST 2018 - GPU 4 under threshold found - GPU UTILIZATION:   0
Mon Apr 30 22:34:44 MST 2018 - GPU 5 under threshold found - GPU UTILIZATION:   0
Mon Apr 30 22:34:45 MST 2018 - GPU 6 under threshold found - GPU UTILIZATION:   0
Mon Apr 30 22:34:45 MST 2018 - GPU 7 under threshold found - GPU UTILIZATION:   0
Mon Apr 30 22:34:46 MST 2018 - GPU 8 under threshold found - GPU UTILIZATION:   10
Mon Apr 30 22:34:46 MST 2018 - GPU 9 under threshold found - GPU UTILIZATION:   0
Mon Apr 30 22:34:46 MST 2018 - GPU 10 under threshold found - GPU UTILIZATION:   0
Mon Apr 30 22:34:46 MST 2018 - GPU 11 under threshold found - GPU UTILIZATION:   0
Mon Apr 30 22:34:47 MST 2018 - GPU 12 under threshold found - GPU UTILIZATION:   0
Connection to google.com 443 port [tcp/https] succeeded!
Connection to google.com 443 port [tcp/https] succeeded!
WARNING: Mon Apr 30 22:34:47 MST 2018 - Found no miner, jumping to 3main restart


WARNING: Mon Apr 30 22:34:47 MST 2018 - Problem found: See diagnostics below:
Percent of GPUs bellow threshold: 84 %
name, pstate, temperature.gpu, fan.speed [%], utilization.gpu [%], power.draw [W], power.limit [W]
GeForce GTX 1070 Ti, P0, 63, 60 %, 27 %, 40.54 W, 120.00 W
GeForce GTX 1070 Ti, P0, 65, 75 %, 0 %, 32.57 W, 120.00 W
GeForce GTX 1070 Ti, P0, 68, 65 %, 0 %, 42.96 W, 120.00 W
GeForce GTX 1080 Ti, P0, 66, 60 %, 0 %, 60.21 W, 175.00 W
GeForce GTX 1070 Ti, P0, 60, 60 %, 0 %, 31.63 W, 120.00 W
GeForce GTX 1070 Ti, P0, 59, 60 %, 0 %, 27.96 W, 120.00 W
GeForce GTX 1070 Ti, P0, 65, 80 %, 0 %, 35.29 W, 120.00 W
GeForce GTX 1070 Ti, P0, 67, 90 %, 0 %, 41.77 W, 120.00 W
GeForce GTX 1080, P0, 61, 60 %, 0 %, 44.37 W, 120.00 W
GeForce GTX 1070, P0, 67, 80 %, 0 %, 36.41 W, 120.00 W
GeForce GTX 1070 Ti, P0, 67, 65 %, 0 %, 42.16 W, 120.00 W
GeForce GTX 1070 Ti, P0, 62, 60 %, 0 %, 34.22 W, 120.00 W
GeForce GTX 1070, P0, 60, 60 %, 0 %, 37.31 W, 120.00 W
+-----+-------------+--------------+
|  0  |    121W     |  3.50 Sol/W  |
|  1  |    122W     |  3.59 Sol/W  |
|  2  |    120W     |  3.69 Sol/W  |
|  3  |    138W     |  4.42 Sol/W  |
|  4  |    117W     |  3.83 Sol/W  |
|  5  |    118W     |  3.81 Sol/W  |
|  6  |    113W     |  3.81 Sol/W  |
|  7  |    125W     |  3.55 Sol/W  |
|  8  |    115W     |  4.23 Sol/W  |
|  9  |    120W     |  3.32 Sol/W  |
| 10  |    122W     |  3.49 Sol/W  |
| 11  |    118W     |  3.69 Sol/W  |
| 12  |    121W     |  3.40 Sol/W  |
+-----+-------------+--------------+
CRITICAL: Mon Apr 30 22:34:47 MST 2018 - GPU Utilization is too low: restarting 3main...
Mon Apr 30 22:37:07 MST 2018 - Back 'on watch' after miner restart
GPU UTILIZATION:  95 98 99 100 99 96 99 100 100 100 99 100 100
      GPU_COUNT:  13
newbie
Activity: 44
Merit: 0
One of my rigs info. The other won't let me in for some reason. I believe it's an access issue, which I figured out in the past but can't remember now.

ID,VENDOR,MODEL,PSTATE,TEMP,FAN,UTILIZATION,POWER,POWERLIMIT,MAXPOWER,GPUCLOCK,MEMCLOCK
--------------------------------------------------------------------------------
0, PNY, GeForce GTX 1070 Ti, P2, 65, 60, 100, 114.30, 120.00, 217.00, 1607, 3874
1, PNY, GeForce GTX 1070 Ti, P2, 72, 70, 100, 118.06, 120.00, 217.00, 1556, 3874
2, PNY, GeForce GTX 1070 Ti, P2, 69, 60, 99, 121.58, 120.00, 217.00, 1569, 3874
3, EVGA, GeForce GTX 1080 Ti, P2, 68, 60, 100, 158.63, 175.00, 300.00, 1544, 5078
4, EVGA, GeForce GTX 1070 Ti, P2, 62, 60, 99, 116.47, 120.00, 217.00, 1594, 3874
5, EVGA, GeForce GTX 1070 Ti, P2, 63, 60, 93, 115.74, 120.00, 217.00, 1594, 3874
6, PNY, GeForce GTX 1070 Ti, P2, 70, 75, 100, 119.80, 120.00, 217.00, 1544, 3874
7, PNY, GeForce GTX 1070 Ti, P2, 73, 80, 99, 119.72, 120.00, 217.00, 1620, 3874
8, EVGA, GeForce GTX 1080, P2, 74, 75, 100, 174.96, 175.00, 217.00, 1898, 4590
9, PNY, GeForce GTX 1070, P2, 72, 80, 100, 122.47, 120.00, 170.00, 1645, 3874
10, PNY, GeForce GTX 1070 Ti, P2, 70, 60, 97, 118.81, 120.00, 217.00, 1544, 3874
11, PNY, GeForce GTX 1070 Ti, P2, 66, 60, 99, 92.56, 120.00, 217.00, 1556, 3874
12, EVGA, GeForce GTX 1070, P2, 62, 60, 100, 121.06, 120.00, 170.00, 1759, 3874
newbie
Activity: 44
Merit: 0
Hey guys,

Anyone else having an issue mining ETH (and ZCL now) and the rig crashing / freezing?

Seems that something happen on the ETH miner, watchdog thinks it lost a GPU and tries to restart 3main. But on both of my rigs, when this happens the system freezes and I have to manually restart.

Errors I have seen -

GPU Utilization is low: restarting 3main...

Thread exited with code: 29

Is there a way to disable 3main restarting without disabling watchdog? The miner would start mining again once connected, but this freezing is my issue.

I'm mining hush temporarily with zero issues. Was on ZCL before the fork without issues as well, same settings and whatnot. And was mining ETH for about 3 weeks with no issues. Now it won't stop crashing after being up 12-24 hours each time. It's driving me nuts.

Two separate rigs, both 13 card (1070 and 1070 Ti's)

Ahhhhhh This is driving me nuts...

On Hush, ZERO issues. Mined for over a week. Switched to ZenCash yesterday "figured Equihash works with Hush to why not".Freezes after 18 hours..

I seriously do not understand. These rigs went months running fine and now are nothing but problematic. Any idea's would be greatly appreciated.

Two rigs, go down at the same time. There has to be some answer as to why.

Can't seem to figure how to attach the photo to the thread.. But here is one of the rigs.

https://imgur.com/a/WRfTksJ
fk1
full member
Activity: 216
Merit: 100
Thank you very much, i will try today after work
full member
Activity: 686
Merit: 140
Linux FOREVER! Resistance is futile!!!
I am currently trying to get the best of nvOC into rxOC. I now this is nvOC thread but I am already very unusual doing and in rxOC thread it feels like no one reads so I have slight hope somebody has a key tip for me here.

I tried to implement WTM switch on my own into my rxOC rig. I failed because too many things to change regarding to watchdog. Then I had the idea to remain my rxOC image and simply copy all things from nvOC home/m1 folder into rxOC home/m1 folder, edited a couple of neccessary things. Now when I insert one card, everything works fine and WTM switches to profitable coin on rxOC!

Unfortunately, when inserting a second card I get this message:

https://ibb.co/ga8obx

I also got this once with one card. After a reboot everything worked fine again. I already switched off integrated gpu in bios. maybe somebody can point me in the right direction please?

btw: still no hashing on xmr on my original nvOC rig. is this working?


You can't use nvOC files in rxOC.
Copy rxOC 1bash, 3main, 2unix, watchdog and temp control to pastebin and send me the links
I will check and see if I can do anything for rxOC.


oneBash: https://pastebin.com/CXGuffeU
2unix: https://pastebin.com/MPzG6tK5

That's all from rxOC. I guess one needs to seperate all the 0miner and 3main stuff from the oneBash file, otherwise things will not work the same as with nvOC and WTM switch. I tried but failed.


Here is the wtm auto switch for rxoc
I dont have rxOC so I can not test
Check and let me know how it goes.


Pages:
Jump to: