I may have a stumbled in a bug.
I have two gpus, with following data:
❯ nvidia-smi --query-gpu=timestamp,name,pstate,utilization.gpu,clocks.sm,clocks.mem,clocks.gr --format=csv
timestamp, name, pstate, utilization.gpu [%], clocks.current.sm [MHz], clocks.current.memory [MHz], clocks.current.graphics [MHz]
2017/07/13 01:45:49.892, GeForce GTX 970, P0, 100 %, 1521 MHz, 3564 MHz, 1521 MHz
2017/07/13 01:45:49.892, GeForce GTX 1060 6GB, P2, 100 %, 1936 MHz, 3802 MHz, 1936 MHz
However while mining all day sometimes gpus crash and I get miner without any hash/s.
I'm running miner with following command:
❯ while sleep 1;do ./miner --server us1-zcash.flypool.org --user .$(hostname)_gpu --pass x --port 3333 --templimit 75 --tempunits C --api 0.0.0.0:42000 --pec --cuda_devices 0 1 --eexit 3;echo "\n\nDEAD\n\n";sleep 1m;done
and that's output of miner:
+-------------------------------------------------+
| EWBF's Zcash CUDA miner. 0.3.4b |
+-------------------------------------------------+
INFO: Current pool: us1-zcash.flypool.org:3333
INFO: Selected pools: 1
INFO: Solver: Auto.
INFO: Devices: User defined.
INFO: Temperature limit: 75
INFO: Api: Listen on 0.0.0.0:42000
---------------------------------------------------
INFO: Target: 00083126e978d4fd...
CUDA: Device: 0 GeForce GTX 1060 6GB, 6072 MB i:64
CUDA: Device: 1 GeForce GTX 970, 4037 MB i:64
INFO: Detected new work: e17f3483aec6a8a7733b
CUDA: Device: 1 Selected solver: 0
CUDA: Device: 0 Selected solver: 0
INFO 01:37:23: GPU0 Accepted share 161ms [A:1, R:0]
INFO 01:37:24: GPU0 Accepted share 161ms [A:2, R:0]
Temp: GPU0: 65C GPU1: 59C
GPU0: 305 Sol/s GPU1: 312 Sol/s
Total speed: 617 Sol/s
+-----+-------------+--------------+
| GPU | Power usage | Efficiency |
+-----+-------------+--------------+
| 0 | 112W | 2.72 Sol/W |
| 1 | 177W | 1.76 Sol/W |
+-----+-------------+--------------+
.... few time later
Temp: GPU0: 66C GPU1: 64C
GPU0: 304 Sol/s GPU1: 308 Sol/s
Total speed: 612 Sol/s
+-----+-------------+--------------+
| GPU | Power usage | Efficiency |
+-----+-------------+--------------+
| 0 | 113W | 2.69 Sol/W |
| 1 | 186W | 1.66 Sol/W |
+-----+-------------+--------------+
INFO 01:24:44: GPU0 Accepted share 161ms [A:72, R:0]
INFO 01:24:58: GPU1 Accepted share 161ms [A:65, R:0]
INFO 01:25:05: GPU0 Accepted share 161ms [A:73, R:0]
ERROR: Looks like GPU1 are stopped. Restart attempt.
Temp: GPU0: 66C GPU1: 50C
GPU0: 305 Sol/s GPU1: 301 Sol/s
Total speed: 606 Sol/s
+-----+-------------+--------------+
| GPU | Power usage | Efficiency |
+-----+-------------+--------------+
| 0 | 112W | 2.72 Sol/W |
| 1 | 66W | 4.56 Sol/W |
+-----+-------------+--------------+
ERROR: Looks like GPU1 are stuck he not respond.
CUDA: Device: 0 Thread exited with code: 6
ERROR: Looks like GPU1 are stopped. Restart attempt.
INFO: GPU1 are restarted.
ERROR: Looks like GPU0 are stopped. Restart attempt.
INFO: GPU0 are restarted.
INFO: Detected new work: f9ada95410732a348394
Temp: GPU0: 51C GPU1: 43C
GPU0: 0 Sol/s GPU1: 0 Sol/s
Total speed: 0 Sol/s
+-----+-------------+--------------+
| GPU | Power usage | Efficiency |
+-----+-------------+--------------+
| 0 | 27W | 0.00 Sol/W |
| 1 | 63W | 0.00 Sol/W |
+-----+-------------+--------------+
Temp: GPU0: 47C GPU1: 42C
GPU0: 0 Sol/s GPU1: 0 Sol/s
Total speed: 0 Sol/s
+-----+-------------+--------------+
| GPU | Power usage | Efficiency |
+-----+-------------+--------------+
| 0 | 23W | 0.00 Sol/W |
| 1 | 57W | 0.00 Sol/W |
+-----+-------------+--------------+
Temp: GPU0: 47C GPU1: 42C
GPU0: 0 Sol/s GPU1: 0 Sol/s
Total speed: 0 Sol/s
+-----+-------------+--------------+
| GPU | Power usage | Efficiency |
+-----+-------------+--------------+
| 0 | 23W | 0.00 Sol/W |
| 1 | 57W | 0.00 Sol/W |
+-----+-------------+--------------+
Temp: GPU0: 48C GPU1: 42C
GPU0: 0 Sol/s GPU1: 0 Sol/s
Total speed: 0 Sol/s
+-----+-------------+--------------+
| GPU | Power usage | Efficiency |
+-----+-------------+--------------+
| 0 | 23W | 0.00 Sol/W |
| 1 | 57W | 0.00 Sol/W |
+-----+-------------+--------------+
ERROR: Looks like GPU1 are stopped. Restart attempt.
ERROR: Looks like GPU1 are stuck he not respond.
Temp: GPU0: 49C GPU1: 42C
GPU0: 0 Sol/s GPU1: 0 Sol/s
Total speed: 0 Sol/s
+-----+-------------+--------------+
| GPU | Power usage | Efficiency |
+-----+-------------+--------------+
| 0 | 23W | 0.00 Sol/W |
| 1 | 56W | 0.00 Sol/W |
+-----+-------------+--------------+
and then that keep like that as long I saw the problem (at least 30 minutes).
So I'm thinking why even with --eexit 3 miner didn't restart properly. My solution was to do kill it in another script:
❯ while sleep 10;do curl --silent "http://localhost:42000/getstat" |grep '"speed_sps":0' && (pkill miner && echo "killed" && sleep 5m);done
Who can help me? How I can help improvement of miner
?
Big thx!