how to identify broken card playing with fans (duhh), but can't interpret syslog or dmesg to pinpoint the faulty one.
any ideas?
dmesg dump
https://wrzutnik.net/quattro/image.php?dm=XA7Z
[54616.169199] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.169205] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.169208] pcieport 0000:00:1c.5: device [8086:a295] error status/mask=00000001/00002000
[54616.169209] pcieport 0000:00:1c.5: [ 0] Receiver Error (First)
[54616.172459] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.172465] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.172467] pcieport 0000:00:1c.5: device [8086:a295] error status/mask=00000001/00002000
[54616.172470] pcieport 0000:00:1c.5: [ 0] Receiver Error (First)
[54616.181447] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.181452] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.181455] pcieport 0000:00:1c.5: device [8086:a295] error status/mask=00000001/00002000
[54616.181457] pcieport 0000:00:1c.5: [ 0] Receiver Error (First)
[54616.186888] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.186894] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.186897] pcieport 0000:00:1c.5: device [8086:a295] error status/mask=00000001/00002000
[54616.186899] pcieport 0000:00:1c.5: [ 0] Receiver Error (First)
[54616.191988] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[54616.191994] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e8(Receiver ID)
[54616.191997] pcieport 0000:00:1d.0: device [8086:a298] error status/mask=00000001/00002000
[54616.191999] pcieport 0000:00:1d.0: [ 0] Receiver Error (First)
[54616.192004] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.192009] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.192010] pcieport 0000:00:1c.5: device [8086:a295] error status/mask=00000001/00002000
[54616.192012] pcieport 0000:00:1c.5: [ 0] Receiver Error (First)
[54616.192015] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.192022] pcieport 0000:00:1c.5: can't find device of ID00e5
[54616.192365] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.192370] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.192373] pcieport 0000:00:1c.5: device [8086:a295] error status/mask=00000001/00002000
[54616.192374] pcieport 0000:00:1c.5: [ 0] Receiver Error (First)
[54616.193515] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[54616.193520] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e8(Receiver ID)
[54616.193522] pcieport 0000:00:1d.0: device [8086:a298] error status/mask=00000001/00002000
[54616.193528] pcieport 0000:00:1d.0: [ 0] Receiver Error (First)
[54616.193793] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[54616.193798] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e8(Receiver ID)
[54616.193800] pcieport 0000:00:1d.0: device [8086:a298] error status/mask=00000001/00002000
[54616.193801] pcieport 0000:00:1d.0: [ 0] Receiver Error (First)
[54616.194941] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[54616.194947] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[54616.194949] pcieport 0000:00:1c.5: device [8086:a295] error status/mask=00000001/00002000
[54616.194951] pcieport 0000:00:1c.5: [ 0] Receiver Error (First)
nvidia-smi
https://wrzutnik.net/quattro/image.php?dm=QBXE
Tue Jun 26 15:31:27 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 90% 57C P2 192W / 190W | 1010MiB / 11172MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 90% 56C P2 186W / 190W | 1010MiB / 11172MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A |
| 90% 58C P2 186W / 190W | 1010MiB / 11172MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 90% 53C P2 186W / 190W | 1010MiB / 11172MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 90% 55C P2 188W / 190W | 1010MiB / 11172MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:07:00.0 Off | N/A |
| 90% 60C P2 185W / 190W | 1010MiB / 11172MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 90% 56C P2 193W / 190W | 1010MiB / 11172MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 90% 52C P2 186W / 190W | 1010MiB / 11172MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 8 GeForce GTX 108... Off | 00000000:0A:00.0 Off | N/A |
| 90% 49C P2 192W / 190W | 1010MiB / 11172MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 9 GeForce GTX 108... Off | 00000000:0B:00.0 Off | N/A |
| 90% 55C P2 193W / 190W | 1010MiB / 11172MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 10 GeForce GTX 108... Off | 00000000:0C:00.0 Off | N/A |
| 90% 56C P2 189W / 190W | 1010MiB / 11172MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 11 GeForce GTX 108... Off | 00000000:0D:00.0 Off | N/A |
| 90% 54C P2 192W / 190W | 1010MiB / 11172MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
context:
not sure if this actually is a problem, other than 100GB of syslog events per hour (I managed to suppress that). claymore is rock solid, but eth is no longer funny to mine ccminer will always die in 5 to 10 minutes, so I am assuming it is pulling some instructions that flip these broken raisers or cads...