Massive Antminer failure | Bitcointalksearch.org

ccgllc

copper member

Activity: 658

Merit: 101

Math doesn't care what you believe.

Quote from: packetsmurf on April 29, 2018, 02:01:23 PM

It sounds like the switch went bad, DHCP should have not given out duplicate IP's to that many devices, if the devices down are all connected to the same switch, then what is the common point of failure, the switch.
Especially if you cant ping past the switch like the FW.

That was my first thought, although warped by the fact that a FEW miners were continuing to work just fine. Swapping parts now to try and isolate. Its almost feeling like I have a double failure somewhere.

And now another anomaly: Even when I get the network to the point where I can ping the machine, somewhat under half of them keep rebooting themselves. (FYI: Its a nice cool day, no temperature problems). I've even disabled all Awesome Miner rules that would do that, but still seeing a few reboots. For instance, I'll see 66 of my 76 14TH machines finally make it up, only to have 30 of them reboot.

Update: Well, one GS316 was definitely flaky. Swapped my left side one for my right side one and the right side came up, and the miners are stable. Either I have (3) bad switches, all that died at 9am this morning (and yes, they are on UPS power), or something else going on.

Update 2: Progress, all but 21 devices are up.

Update 3: All but one L3 up and running. All (26) L3s appear to be normal, no red light, network lights flashing, green light flashing. Pretty near impossible to find after this stressful day. I can live with it until I get some sleep and booze.

The primary problem was a bad network switch between the firewall and the edge switches. I'm upgrading both of those "core switches" (If you can call them that, since they are unmanaged) to Netgear Pro series. Discovery was complicated by partial functionality on an intermittent basis. Discovery was also hampered by a backup switch (an old 8-port TP-Link I had left over) also not working reliably.

packetsmurf

newbie

Activity: 8

Merit: 5

It sounds like the switch went bad, DHCP should have not given out duplicate IP's to that many devices, if the devices down are all connected to the same switch, then what is the common point of failure, the switch.
Especially if you cant ping past the switch like the FW.

ccgllc

copper member

Activity: 658

Merit: 101

Math doesn't care what you believe.

Quote from: swanny88 on April 29, 2018, 01:51:19 PM

Sounds like something is going wrong with your backbone. Do you have all your switches daisy chained? Do you have a DHCP server setup or are all the miners working off of DHCP?

Agree. Left and right side switches meet at the firewall. Everything is working off of DHCP... but that brings up a good point. Possible some of my "Down" machines have simply been given new addresses! Off to check...

swanny88

newbie

Activity: 103

Merit: 0

Sounds like something is going wrong with your backbone. Do you have all your switches daisy chained? Do you have a DHCP server setup or are all the miners working off of DHCP?

ccgllc

copper member

Activity: 658

Merit: 101

Math doesn't care what you believe.

Around 9am CST6CDT the vast majority of my antminer farm went offline. At the moment I have about 150 S9s and a few L3s offline.

The miners physically look fine, with flashing green lights, no red lights (mostly), network activity, etc. However the majority can not be pinged, and a few that can be are not starting up.

Current network topology: PFSense firewall -> right side Netgear GS316 -> Netgear GS316 -> Antminer S9s
-> left side Netgear 10-100 switch on right side -> Netgear 7 port gigabit switch on left side -> Netgear GS316 -> Antminer S9s & L3s
-> Avalon 821s & 841s

My normal topology is to have the right side Netgear GS316 also support the left side, but I split the network using a redundant feed I had back to the PFSense firewall with morning in an attempt to isolate the problem.

Of my (76) 14TH S9s, (24) are active, the rest can not be pinged. Likewise (9) of my (25) 13.5TH S9s are reachable - mostly on the right side.

The left side has (11) of (13) 13TH S9s working, (2) T9s, and (2) of ( Cool

"problem children" S9s. All of the Avalons are fine, but of course, they are clustered behind a few PIs, so have a lower network port count. (22) of my (26) L3s cannot be pinged.

So both the left side and right side is having problems, and they are independent of each other network wise back to PFSense firewall box. Occasionally I'm seeing Antminers go blinking Red, but a quick power cycle clears that.

I'm at wits end without a clue. I'd be fine if I lost a switch. But my problem children appear to be spread across several switch, and in fact, several physical networks. The LAN side IP addressing is shared at the firewall, but I can't see how that would be a problem.

Although growing (with the latest batch of 34 mixed Antminers being added early this week), the network has been otherwise stable until this morning.

Somebody please! Give me some ideas of what I am overlooking...

Topic: Massive Antminer failure (Read 276 times)