Author

Topic: tracking failing raiser in linux (using dmesg & nvidia-smi output)? (Read 75 times)

newbie
Activity: 21
Merit: 0
here's my problem... I want to identify failing card or riser -  (id=00e8) and (id=00e5) I learn

how to identify broken card playing with fans (duhh), but can't interpret syslog or dmesg to pinpoint the faulty one.

any ideas?

dmesg dump
https://wrzutnik.net/quattro/image.php?dm=XA7Z

Code:
[54616.169199] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.169205] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.169208] pcieport 0000:00:1c.5:   device [8086:a295] error status/mask=00000001/00002000
[54616.169209] pcieport 0000:00:1c.5:    [ 0] Receiver Error         (First)
[54616.172459] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.172465] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.172467] pcieport 0000:00:1c.5:   device [8086:a295] error status/mask=00000001/00002000
[54616.172470] pcieport 0000:00:1c.5:    [ 0] Receiver Error         (First)
[54616.181447] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.181452] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.181455] pcieport 0000:00:1c.5:   device [8086:a295] error status/mask=00000001/00002000
[54616.181457] pcieport 0000:00:1c.5:    [ 0] Receiver Error         (First)
[54616.186888] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.186894] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.186897] pcieport 0000:00:1c.5:   device [8086:a295] error status/mask=00000001/00002000
[54616.186899] pcieport 0000:00:1c.5:    [ 0] Receiver Error         (First)
[54616.191988] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[54616.191994] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e8(Receiver ID)
[54616.191997] pcieport 0000:00:1d.0:   device [8086:a298] error status/mask=00000001/00002000
[54616.191999] pcieport 0000:00:1d.0:    [ 0] Receiver Error         (First)
[54616.192004] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.192009] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.192010] pcieport 0000:00:1c.5:   device [8086:a295] error status/mask=00000001/00002000
[54616.192012] pcieport 0000:00:1c.5:    [ 0] Receiver Error         (First)
[54616.192015] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.192022] pcieport 0000:00:1c.5: can't find device of ID00e5
[54616.192365] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.192370] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.192373] pcieport 0000:00:1c.5:   device [8086:a295] error status/mask=00000001/00002000
[54616.192374] pcieport 0000:00:1c.5:    [ 0] Receiver Error         (First)
[54616.193515] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[54616.193520] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e8(Receiver ID)
[54616.193522] pcieport 0000:00:1d.0:   device [8086:a298] error status/mask=00000001/00002000
[54616.193528] pcieport 0000:00:1d.0:    [ 0] Receiver Error         (First)
[54616.193793] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[54616.193798] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e8(Receiver ID)
[54616.193800] pcieport 0000:00:1d.0:   device [8086:a298] error status/mask=00000001/00002000
[54616.193801] pcieport 0000:00:1d.0:    [ 0] Receiver Error         (First)
[54616.194941] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.194947] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.194949] pcieport 0000:00:1c.5:   device [8086:a295] error status/mask=00000001/00002000
[54616.194951] pcieport 0000:00:1c.5:    [ 0] Receiver Error         (First)


nvidia-smi
https://wrzutnik.net/quattro/image.php?dm=QBXE

Code:
Tue Jun 26 15:31:27 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
| 90%   57C    P2   192W / 190W |   1010MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 90%   56C    P2   186W / 190W |   1010MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 90%   58C    P2   186W / 190W |   1010MiB / 11172MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 90%   53C    P2   186W / 190W |   1010MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 90%   55C    P2   188W / 190W |   1010MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:07:00.0 Off |                  N/A |
| 90%   60C    P2   185W / 190W |   1010MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 90%   56C    P2   193W / 190W |   1010MiB / 11172MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 90%   52C    P2   186W / 190W |   1010MiB / 11172MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   8  GeForce GTX 108...  Off  | 00000000:0A:00.0 Off |                  N/A |
| 90%   49C    P2   192W / 190W |   1010MiB / 11172MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   9  GeForce GTX 108...  Off  | 00000000:0B:00.0 Off |                  N/A |
| 90%   55C    P2   193W / 190W |   1010MiB / 11172MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|  10  GeForce GTX 108...  Off  | 00000000:0C:00.0 Off |                  N/A |
| 90%   56C    P2   189W / 190W |   1010MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|  11  GeForce GTX 108...  Off  | 00000000:0D:00.0 Off |                  N/A |
| 90%   54C    P2   192W / 190W |   1010MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+




context:
not sure if this actually is a problem, other than 100GB of syslog events per hour (I managed to suppress that). claymore is rock solid, but eth is no longer funny to mine Wink ccminer will always die in 5 to 10 minutes, so I am assuming it is pulling some instructions that flip these broken raisers or cads...
Jump to: