Fan control still needs work.
Got the alert from litecoinpool that the apollo had dropped off this morning. I could not get the the web front end, so I ignored it figuring on a crash.
The L3+ that is on the same power supply never dropped so I knew the power and internet were good.
Got in to work and fan was not spinning AND unit was about 1 billion degrees (give or take a bit) was actually hot enough to darken the thermal ink label on the controller.
Once it cooled down it seems to have come back no problem.
I now have the fan set to 25% instead of auto.
In the next revision can you have the fan fail to max speed in case of any issues?
What mode were you running in? Eco, Balanced or Turbo?
Eco
Looking at the content of the current sdcard image and the implementation of the fan control, there does appear to be a failure scenario that could explain what you saw happen.
The startup process of the Apollo first starts the network, followed by apollo.service and bfgminer.service. apollo.service is the UI itself and it has been configured to restart on failure. bfgminer.service on the other hand contains the hardware monitoring and fan control process (apollo_hwmon) and, after a 20 second delay, bfgminer itself is started. The problem here is that the bfgminer.service contains several processes and the init system hasn't been configured to monitor any of these for doing a restart on failure. For some reason, possibly for implementation simplicity, the fan control is that separate process instead of being part of the bfgminer driver.
I suspect that setup results in a situation that if the apollo_hwmon process dies for some reason then nothing is there to notice and restart it. Since the process is handling the pwm signal generation for the fan, this results in the fan either shutting down or getting stuck at 100% depending on what the current setting was when the process dies. With a 25% manual setting, I'd guess there's a 25% chance on getting stuck at 100% rpm and 75% for a fan shutdown (0% rpm).
The easiest fix would be to move that apollo_hwmon process to a separate systemd service that would be configured to have the same restart on failure monitoring as the UI itself has. Having the separate service would also allow to chain the bfgminer process/service as a dependency on fans having started instead of blindly assuming that after 20 seconds everything should be ok. Longer term, the better approach would most likely be having the fan control in the bfgminer process (provided by the Apollo driver) as the unit isn't producing much heat if the bfgminer process itself isn't running.
Disclaimer: this is what I can see based on what's visible in the image itself. Some of it could be wrong as for example the source code of the apollo_hwmon binary isn't available.