Topic: [OS] nvOC easy-to-use Linux Nvidia Mining - page 30. (Read 418244 times)

Quote from: urnzwy on May 03, 2018, 09:52:03 AM

Not sure this will be relevant but your system of 12 cards using one 850 psu for the risers is how I killed two psu. I only had 7 cards running and it was too much for the Sata channels to handle. One psu will be good for 6 sata devices so I think you are browning out the main psu. Once I dropped back on card count, no more problems. In the end, I gave up on the Server psu's and went with dual 1000w psu's.

thay

The 850W PSU only runs the motherboard and one card and it's riser. I have three PSU's per rig. Two 1600W (6 cards and their risers each) and the 850W as described. I am no were near the amount of power the power supplies are capable of, so should be good there.

This is one of the rigs - https://imgur.com/a/WRfTksJ

That is a really nice (and big) rig. Did you find anything in syslog (as per my previous post)?

m1@m1-desktop:~$ sudo less /var/log/dmesg
(Nothing has been logged yet.)

Is all I get. I am using 19.2

Stubo

member

Activity: 224

Merit: 13

Not sure this will be relevant but your system of 12 cards using one 850 psu for the risers is how I killed two psu. I only had 7 cards running and it was too much for the Sata channels to handle. One psu will be good for 6 sata devices so I think you are browning out the main psu. Once I dropped back on card count, no more problems. In the end, I gave up on the Server psu's and went with dual 1000w psu's.

thay

The 850W PSU only runs the motherboard and one card and it's riser. I have three PSU's per rig. Two 1600W (6 cards and their risers each) and the 850W as described. I am no were near the amount of power the power supplies are capable of, so should be good there.

This is one of the rigs - https://imgur.com/a/WRfTksJ

That is a really nice (and big) rig. Did you find anything in syslog (as per my previous post)?

urnzwy

newbie

Activity: 44

Merit: 0

Not sure this will be relevant but your system of 12 cards using one 850 psu for the risers is how I killed two psu. I only had 7 cards running and it was too much for the Sata channels to handle. One psu will be good for 6 sata devices so I think you are browning out the main psu. Once I dropped back on card count, no more problems. In the end, I gave up on the Server psu's and went with dual 1000w psu's.

thay

The 850W PSU only runs the motherboard and one card and it's riser. I have three PSU's per rig. Two 1600W (6 cards and their risers each) and the 850W as described. I am no were near the amount of power the power supplies are capable of, so should be good there.

This is one of the rigs - https://imgur.com/a/WRfTksJ

Stubo

member

Activity: 224

Merit: 13

Quote from: urnzwy on May 02, 2018, 06:21:00 PM

Not sure this will be relevant but your system of 12 cards using one 850 psu for the risers is how I killed two psu. I only had 7 cards running and it was too much for the Sata channels to handle. One psu will be good for 6 sata devices so I think you are browning out the main psu. Once I dropped back on card count, no more problems. In the end, I gave up on the Server psu's and went with dual 1000w psu's.

thay

That is a good point, Thay. I was very surprised to learn how much some GPUs pull through the PCI slot and also that it varies by GPU and MFR. If we assume 50w per card (may even be low) and another 80w for the mobo, that gives us 680w. That is what I would consider to be the max usable for a 850w PSU (80% of rated output).

EDIT: IMO, trying to get a 12 GPU rig to be rock stable is an exercise in frustration. I have read dozens of messages on this thread (and others) of folks having all kinds of oddball problems with rigs that use more than 8 GPUs. This includes not only nvOC, but also Windows, SMOS, and HiveOS. It can be done, but it comes with an inherent lack of stability. If I were you, I would split it up into 2 rigs and get on with mining.

thaelin

newbie

Activity: 64

Merit: 0

Not sure this will be relevant but your system of 12 cards using one 850 psu for the risers is how I killed two psu. I only had 7 cards running and it was too much for the Sata channels to handle. One psu will be good for 6 sata devices so I think you are browning out the main psu. Once I dropped back on card count, no more problems. In the end, I gave up on the Server psu's and went with dual 1000w psu's.

thay

Stubo

member

Activity: 224

Merit: 13

I have two 1600W server PSU's for the cards (6 on each) one 850W EVGA ATX with one card / riser.

All using 6-8pin, risers are split once (1 cable per 2 risers)

Any chance it's a memory issue? I am running 4GB, my memory says 3.2GB / 4GB in use (85%).

Since disabling watchdog, it shouldn't be a server issue. The miner would just keep trying to reconnect.

Check the logs (in /var/log). What does syslog say?

urnzwy

newbie

Activity: 44

Merit: 0

Quote from: papampi on May 01, 2018, 01:57:03 PM

Quote from: urnzwy on May 01, 2018, 10:06:40 AM

Quote from: urnzwy on May 01, 2018, 06:14:37 PM

Error on rig 2 - Two different rigs, crashing within a minute of each other. Tell me that isn't weird.

tail -f /home/m1/nvoc_logs/watchdog-screenlog.0
m1@m1-desktop:~$ tail -f /home/m1/nvoc_logs/watchdog-screenlog.0
Watchdog for nvOC v0019-2.0 - Community Release
Version: v0019-2.0.011

LOG FILE: (Showing the last 10 recorded entries)
| 12 | 120W | 3.42 Sol/W |
+-----+-------------+--------------+
INFO 09:34:50: GPU3 Accepted share 186ms [A:454, R:1]
INFO 09:34:51: GPU7 Accepted share 187ms [A:477, R:1]
CRITICAL: Sun Apr 29 09:35:17 MST 2018 - GPU Utilization is too low: restarting 3main...
Mon Apr 30 22:35:29 MST 2018 - Lost GPU so restarting system. Found GPU's:
Unable to determine the device handle for GPU 0000:0F:00.0: GPU is lost. Reboot the system to recover this GPU

Mon Apr 30 22:35:30 MST 2018 - reboot in 10 seconds

If both rigs crash and freeze at the same time, it can be electrical problem
I had almost same issue a while back and some of my rigs were crashing all at the same time,
found out when one of the room venting fans was turning on it was making a high frequency noise in electricity and 3-4 rigs gets the lost gpu at the same time and reboot.
After a month of pulling my hairs to find the problem I changed that fan and problem solved.

Open 5watchdog
Change:

Code:

        echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

To:

Code:

        echo "$(date) - Lost GPU $GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

So you can check GPU number that is lost, then check if it is the same GPU always get lost?
If its always the same GPU, remove it from the rig and check, may be a faulty GPU, riser or power cable.
If after removing the GPU, the problem jumps to another GPU then it could be a power problem.

Hmmmm interesting about the electrical issue. I have two 8" hyperfans used as exhaust fans that run 100% 24/7. But could be a possibility.

Last night I recompiled the miners again, set the max fan limit to 90 from another one of your posts and set the power restore to 80 and changed EWBF to 3_3 from 3_4.

I will try this for a day and see if anything happens. It's just so strange that on Hush, it worked completely fine.

I had this kinda happen before and it was the mining server "disconnecting". Switched pools and all was good.

Any way to make watchdog wait an extended period of time for error's to clear themselves before trying to restart 3main?

Also, do the miners themselves if watch dog is disabled keep a max temp limit? I notice when starting EWBF it says max temp 90*. While I don't want temps that high, if it keeps the miner going and safe then I will consider it.

I have two 20 inches fans with a thermostat that start them when room temp goes above 15C and stop at 10C, and one of them causing problem at startup.
Anyway, I suggest you try out dstm zm miner, much better in my experience.
What are your PSUs? and how much you draw from them? and what about riser powers?

If both rigs restart 3main at same time when mining some coins and all is good with other coin try to change the pool.
Recently I'm getting lots of pool disconnect from MPH and all rigs restart 3main almost at same time.

I have two 1600W server PSU's for the cards (6 on each) one 850W EVGA ATX with one card / riser.

All using 6-8pin, risers are split once (1 cable per 2 risers)

Any chance it's a memory issue? I am running 4GB, my memory says 3.2GB / 4GB in use (85%).

Since disabling watchdog, it shouldn't be a server issue. The miner would just keep trying to reconnect.

urnzwy

newbie

Activity: 44

Merit: 0

So it looks like the server is temp disconnecting and crashing the miners. This is going to happen. So I am going to disable watchdog and just let it re-connect on it's own.

I was on MPH and switched to miningspeed. Same issue. Maybe my rigs just don't like to be restarted and like to work all the time. lol

Latest Errors -

LOG FILE: (Showing the last 10 recorded entries)
CUDA: Device: 8 Thread exited with code: 46
CUDA: Device: 7 Thread exited with code: 46
CUDA: Device: 11 Thread exited with code: 46
CUDA: Device: 12 User selected solver: 0
CUDA: Device: 3 User selected solver: 0
CUDA: Device: 4 User selected solver: 0
CUDA: Device: 12 Thread exited with code: 46
CUDA: Device: 3 Thread exited with code: 46
CUDA: Device: 4 Thread exited with code: 46
CRITICAL: Tue May 1 15:17:20 MST 2018 - GPU Utilization is too low: restarting 3main...

LOG FILE: (Showing the last 10 recorded entries)
+-------------------------------------------------+
INFO: Server: mining.miningspeed.com:3062
INFO: Solver Auto.
INFO: Devices: All.
INFO: Temperature limit: 90
INFO: Api: Disabled
---------------------------------------------------
ERROR: Cannot connect to the server. 1
CRITICAL: Tue May 1 04:00:44 MST 2018 - GPU Utilization is too low: restarting 3main...
WARNING: Tue May 1 16:03:20 MST 2018 - Internet is down, checking...

Well... Disabling watchdog didn't work. Still crashed Blah

ha5hi5h

newbie

Activity: 4

Merit: 0

Quote from: Stubo on April 30, 2018, 04:08:33 AM

Quote from: ha5hi5h on April 29, 2018, 11:47:33 PM

Hi,

I noticed that there are a few double periods in the 0miner file:

Line 883 onwards:
if [ $COIN == "PASC" ]
then
HCD='/home/m1/pasc/sgminer'
ADDR="$PASC_ADDRESS..$PASC_WORKER"

Is this right? Or should I reduce it to a single period?

If all of the double periods follow that example where they are between the address and worker name, then yes. There should only be a single period between them.

Thanks for the clarification - anyway to make sure this gets included in the next update? I was wondering why my worker didn't show up.

urnzwy

newbie

Activity: 44

Merit: 0

So it looks like the server is temp disconnecting and crashing the miners. This is going to happen. So I am going to disable watchdog and just let it re-connect on it's own.

I was on MPH and switched to miningspeed. Same issue. Maybe my rigs just don't like to be restarted and like to work all the time. lol

Latest Errors -

LOG FILE: (Showing the last 10 recorded entries)
CUDA: Device: 8 Thread exited with code: 46
CUDA: Device: 7 Thread exited with code: 46
CUDA: Device: 11 Thread exited with code: 46
CUDA: Device: 12 User selected solver: 0
CUDA: Device: 3 User selected solver: 0
CUDA: Device: 4 User selected solver: 0
CUDA: Device: 12 Thread exited with code: 46
CUDA: Device: 3 Thread exited with code: 46
CUDA: Device: 4 Thread exited with code: 46
CRITICAL: Tue May 1 15:17:20 MST 2018 - GPU Utilization is too low: restarting 3main...

LOG FILE: (Showing the last 10 recorded entries)
+-------------------------------------------------+
INFO: Server: mining.miningspeed.com:3062
INFO: Solver Auto.
INFO: Devices: All.
INFO: Temperature limit: 90
INFO: Api: Disabled
---------------------------------------------------
ERROR: Cannot connect to the server. 1
CRITICAL: Tue May 1 04:00:44 MST 2018 - GPU Utilization is too low: restarting 3main...
WARNING: Tue May 1 16:03:20 MST 2018 - Internet is down, checking...

papampi

full member

Activity: 686

Merit: 140

Linux FOREVER! Resistance is futile!!!

Quote from: urnzwy on May 01, 2018, 10:06:40 AM

Quote from: urnzwy on May 01, 2018, 10:06:40 AM

Error on rig 2 - Two different rigs, crashing within a minute of each other. Tell me that isn't weird.

tail -f /home/m1/nvoc_logs/watchdog-screenlog.0
m1@m1-desktop:~$ tail -f /home/m1/nvoc_logs/watchdog-screenlog.0
Watchdog for nvOC v0019-2.0 - Community Release
Version: v0019-2.0.011

LOG FILE: (Showing the last 10 recorded entries)
| 12 | 120W | 3.42 Sol/W |
+-----+-------------+--------------+
INFO 09:34:50: GPU3 Accepted share 186ms [A:454, R:1]
INFO 09:34:51: GPU7 Accepted share 187ms [A:477, R:1]
CRITICAL: Sun Apr 29 09:35:17 MST 2018 - GPU Utilization is too low: restarting 3main...
Mon Apr 30 22:35:29 MST 2018 - Lost GPU so restarting system. Found GPU's:
Unable to determine the device handle for GPU 0000:0F:00.0: GPU is lost. Reboot the system to recover this GPU

Mon Apr 30 22:35:30 MST 2018 - reboot in 10 seconds

If both rigs crash and freeze at the same time, it can be electrical problem
I had almost same issue a while back and some of my rigs were crashing all at the same time,
found out when one of the room venting fans was turning on it was making a high frequency noise in electricity and 3-4 rigs gets the lost gpu at the same time and reboot.
After a month of pulling my hairs to find the problem I changed that fan and problem solved.

Open 5watchdog
Change:

Code:

        echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

To:

Code:

        echo "$(date) - Lost GPU $GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

So you can check GPU number that is lost, then check if it is the same GPU always get lost?
If its always the same GPU, remove it from the rig and check, may be a faulty GPU, riser or power cable.
If after removing the GPU, the problem jumps to another GPU then it could be a power problem.

Hmmmm interesting about the electrical issue. I have two 8" hyperfans used as exhaust fans that run 100% 24/7. But could be a possibility.

Last night I recompiled the miners again, set the max fan limit to 90 from another one of your posts and set the power restore to 80 and changed EWBF to 3_3 from 3_4.

I will try this for a day and see if anything happens. It's just so strange that on Hush, it worked completely fine.

I had this kinda happen before and it was the mining server "disconnecting". Switched pools and all was good.

Any way to make watchdog wait an extended period of time for error's to clear themselves before trying to restart 3main?

Also, do the miners themselves if watch dog is disabled keep a max temp limit? I notice when starting EWBF it says max temp 90*. While I don't want temps that high, if it keeps the miner going and safe then I will consider it.

I have two 20 inches fans with a thermostat that start them when room temp goes above 15C and stop at 10C, and one of them causing problem at startup.
Anyway, I suggest you try out dstm zm miner, much better in my experience.
What are your PSUs? and how much you draw from them? and what about riser powers?

If both rigs restart 3main at same time when mining some coins and all is good with other coin try to change the pool.
Recently I'm getting lots of pool disconnect from MPH and all rigs restart 3main almost at same time.

Stubo

member

Activity: 224

Merit: 13

Error on rig 2 - Two different rigs, crashing within a minute of each other. Tell me that isn't weird.

tail -f /home/m1/nvoc_logs/watchdog-screenlog.0
m1@m1-desktop:~$ tail -f /home/m1/nvoc_logs/watchdog-screenlog.0
Watchdog for nvOC v0019-2.0 - Community Release
Version: v0019-2.0.011

LOG FILE: (Showing the last 10 recorded entries)
| 12 | 120W | 3.42 Sol/W |
+-----+-------------+--------------+
INFO 09:34:50: GPU3 Accepted share 186ms [A:454, R:1]
INFO 09:34:51: GPU7 Accepted share 187ms [A:477, R:1]
CRITICAL: Sun Apr 29 09:35:17 MST 2018 - GPU Utilization is too low: restarting 3main...
Mon Apr 30 22:35:29 MST 2018 - Lost GPU so restarting system. Found GPU's:
Unable to determine the device handle for GPU 0000:0F:00.0: GPU is lost. Reboot the system to recover this GPU

Mon Apr 30 22:35:30 MST 2018 - reboot in 10 seconds

If both rigs crash and freeze at the same time, it can be electrical problem
I had almost same issue a while back and some of my rigs were crashing all at the same time,
found out when one of the room venting fans was turning on it was making a high frequency noise in electricity and 3-4 rigs gets the lost gpu at the same time and reboot.
After a month of pulling my hairs to find the problem I changed that fan and problem solved.

Open 5watchdog
Change:

Code:

        echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

To:

Code:

        echo "$(date) - Lost GPU $GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

So you can check GPU number that is lost, then check if it is the same GPU always get lost?
If its always the same GPU, remove it from the rig and check, may be a faulty GPU, riser or power cable.
If after removing the GPU, the problem jumps to another GPU then it could be a power problem.

Hmmmm interesting about the electrical issue. I have two 8" hyperfans used as exhaust fans that run 100% 24/7. But could be a possibility.

Last night I recompiled the miners again, set the max fan limit to 90 from another one of your posts and set the power restore to 80 and changed EWBF to 3_3 from 3_4.

I will try this for a day and see if anything happens. It's just so strange that on Hush, it worked completely fine.

I had this kinda happen before and it was the mining server "disconnecting". Switched pools and all was good.

Any way to make watchdog wait an extended period of time for error's to clear themselves before trying to restart 3main?

Also, do the miners themselves if watch dog is disabled keep a max temp limit? I notice when starting EWBF it says max temp 90*. While I don't want temps that high, if it keeps the miner going and safe then I will consider it.

The watchdog and the temp control are 2 different scripts so even if you disable the watchdog, the temp control will still do its thing. If you want to expand the time between checks for the watchdog, change the interval of the main loop. At the bottom of the script, you will see this line:

Code:

sleep 10

Change this to a larger value like 15 or 20. NOTE that increasing this value on a rig with a lot of GPUs will dramatically increase the amount of time before the watchdog bounces the miner in the event that a problem is detected on a single GPU.

urnzwy

newbie

Activity: 44

Merit: 0

Error on rig 2 - Two different rigs, crashing within a minute of each other. Tell me that isn't weird.

tail -f /home/m1/nvoc_logs/watchdog-screenlog.0
m1@m1-desktop:~$ tail -f /home/m1/nvoc_logs/watchdog-screenlog.0
Watchdog for nvOC v0019-2.0 - Community Release
Version: v0019-2.0.011

LOG FILE: (Showing the last 10 recorded entries)
| 12 | 120W | 3.42 Sol/W |
+-----+-------------+--------------+
INFO 09:34:50: GPU3 Accepted share 186ms [A:454, R:1]
INFO 09:34:51: GPU7 Accepted share 187ms [A:477, R:1]
CRITICAL: Sun Apr 29 09:35:17 MST 2018 - GPU Utilization is too low: restarting 3main...
Mon Apr 30 22:35:29 MST 2018 - Lost GPU so restarting system. Found GPU's:
Unable to determine the device handle for GPU 0000:0F:00.0: GPU is lost. Reboot the system to recover this GPU

Mon Apr 30 22:35:30 MST 2018 - reboot in 10 seconds

If both rigs crash and freeze at the same time, it can be electrical problem
I had almost same issue a while back and some of my rigs were crashing all at the same time,
found out when one of the room venting fans was turning on it was making a high frequency noise in electricity and 3-4 rigs gets the lost gpu at the same time and reboot.
After a month of pulling my hairs to find the problem I changed that fan and problem solved.

Open 5watchdog
Change:

Code:

        echo "$(date) - Lost GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

To:

Code:

        echo "$(date) - Lost GPU $GPU so restarting system. Found GPU's:" | tee -a ${LOG_FILE}

So you can check GPU number that is lost, then check if it is the same GPU always get lost?
If its always the same GPU, remove it from the rig and check, may be a faulty GPU, riser or power cable.
If after removing the GPU, the problem jumps to another GPU then it could be a power problem.

Hmmmm interesting about the electrical issue. I have two 8" hyperfans used as exhaust fans that run 100% 24/7. But could be a possibility.

Last night I recompiled the miners again, set the max fan limit to 90 from another one of your posts and set the power restore to 80 and changed EWBF to 3_3 from 3_4.

I will try this for a day and see if anything happens. It's just so strange that on Hush, it worked completely fine.

I had this kinda happen before and it was the mining server "disconnecting". Switched pools and all was good.

Any way to make watchdog wait an extended period of time for error's to clear themselves before trying to restart 3main?

Also, do the miners themselves if watch dog is disabled keep a max temp limit? I notice when starting EWBF it says max temp 90*. While I don't want temps that high, if it keeps the miner going and safe then I will consider it.

papampi

full member

Activity: 686

Merit: 140

Linux FOREVER! Resistance is futile!!!