[Antminer S9] Miner stops hashing - all stats freeze except LSTime

philipma1957

legendary

Activity: 4354

Merit: 9201

'The right to privacy matters'

Quote from: Artemis3 on January 29, 2019, 11:22:31 AM

And perhaps the pool's AB implementation varies as well? Maybe one likes BraiinsOS's (cgminer's?) more than Bitmain's (bmminer's), and with another pool is just the opposite?

could be.

I will need to spring for awesome miner as I will be growing units.

I must say my m10's = Godlike they just run and run and run and run and run.
I am up to 6 of them

Artemis3

legendary

Activity: 2030

Merit: 1573

CLEAN non GPL infringing code made in Rust lang

Quote from: Biffa on January 29, 2019, 04:57:54 AM

Also, as AB is something that needs to be coded at both the miner and the pool side, could it be pool related? I moved one to a different pool after it kept happening and its been stable for days now.

And perhaps the pool's AB implementation varies as well? Maybe one likes BraiinsOS's (cgminer's?) more than Bitmain's (bmminer's), and with another pool is just the opposite?

Biffa

legendary

Activity: 3234

Merit: 1221

Quote from: fanatic26_ on January 28, 2019, 03:30:45 PM

I can say with certainty that this issue is power supply agnostic. I have 6000+ non Bitmain PSUs powering S9s and T9s and the issue happened with them as well.

Well that scuppers that theory

All my S9's are AB enabled, I'm seeing it sporadically in 3 of the machines. Although I have to say, even before AB I had to reboot S9's sporadically for one reason or another as well, for obv different reasons.

Sometimes it happens 5 minutes after rebooting, sometimes you it lasts for days.

Lets face it, these machines are all individual in the way they behave anyway, maybe some machines are more sensitive to the changes that AB makes to how it runs, perhaps more sensitive to changes in voltage or how the AB firmware tries to maintain hashrate at lower power levels.

Just had to reboot one 5 mins ago, it was up, web interface accessible, miner status show it online, but not updating Elapsed, RT or AVG numbers. Interestingly one board had stopped working and of course no hashrate on the pool, and nothing in the log, rebooted it and its hashing away again.

Also, as AB is something that needs to be coded at both the miner and the pool side, could it be pool related? I moved one to a different pool after it kept happening and its been stable for days now.

philipma1957

legendary

Activity: 4354

Merit: 9201

'The right to privacy matters'

Quote from: Artemis3 on January 28, 2019, 09:15:44 PM

And turning off LPM and enhanced LPM modes doesn't work?

This would be my question.

I finally have begun expanding my mining.

I have 19 s9s
I have 53 different units all told and will be going to 100 units.
Then 150 units.

I have the same symptom of freeze. Has anyone tried to not check the boxes and allow full speed.

Artemis3

legendary

Activity: 2030

Merit: 1573

CLEAN non GPL infringing code made in Rust lang

Quote from: tim-bc on January 28, 2019, 07:00:04 PM

So I see that this bug is exclusively with bitmain's LPM and enhanced LPM firmwares. I wonder why this happens and if they are aware / plan to fix it.

Some miners always seem to freeze at the same time, sometimes as low as 10-15 minutes after booting. The only solution for these seems to be to downgrade to non-LPM firmware (or an alternative firmware)...

And turning off LPM and enhanced LPM modes doesn't work?

tim-bc

full member

Activity: 538

Merit: 175

So I see that this bug is exclusively with bitmain's LPM and enhanced LPM firmwares. I wonder why this happens and if they are aware / plan to fix it.

Some miners always seem to freeze at the same time, sometimes as low as 10-15 minutes after booting. The only solution for these seems to be to downgrade to non-LPM firmware (or an alternative firmware)...

fanatic26_

full member

Activity: 294

Merit: 129

Quote from: Biffa on January 28, 2019, 01:27:54 PM

How cold is it where the miners are? I'm seeing this sporadically on miners in very cold conditions, and it may be related more to the Bitmain PSU's than the miners themselves it seems.

I can say with certainty that this issue is power supply agnostic. I have 6000+ non Bitmain PSUs powering S9s and T9s and the issue happened with them as well.

Biffa

legendary

Activity: 3234

Merit: 1221

How cold is it where the miners are? I'm seeing this sporadically on miners in very cold conditions, and it may be related more to the Bitmain PSU's than the miners themselves it seems.

tim-bc

full member

Activity: 538

Merit: 175

Quote from: kano on January 09, 2019, 07:44:08 PM

No - wipe them, put back the original firmware, update them to asicboost and then they should be OK.
Clearly there was something wrong with the firmware on them when you got them ...

I tried that; some of the miners were used and had an older Sept '17 firmware before putting the asicboost files on. I reflashed the whole filesystem to Nov 2017 autofreq (which fixed the issue) and then flashed the asicboost again (which caused the issue to reoccur).

It seems like it will freeze up at a consistent time on each individual miner, but the time that it freezes up varies between miners.

Also there are a lot of miners that were new and have only ever had the Nov 2017 autofreq firmware and they also present the same symptoms after being boosted.

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

Quote from: tim-bc on January 09, 2019, 07:20:10 PM

...

Quote from: kano on January 09, 2019, 06:50:17 PM

... use original oem firmware ...

The asicboost "firmware" is straight from Bitmain. Besides, nowadays using non-asicboost firmware with the S9 is pointless and often unprofitable.

No - wipe them, put back the original firmware, update them to asicboost and then they should be OK.
Clearly there was something wrong with the firmware on them when you got them ...

tim-bc

full member

Activity: 538

Merit: 175

Quote from: fanatic26_ on January 09, 2019, 01:35:18 PM

I have the same issue trying to figure out which are asicboost and which arent. Sometimes the firmware date changes to the Nov 2 date of the ASICboost patch, and sometimes it retains the underlying firmwares date so you actually have to login and look for the LPM checkbox.

Are you sure about the date thing? Sounds like the firmwares might have been flashed incompletely or something. All of my asicboosted miners all have Nov 2018 file dates. None of my non-asicboost have the LPM checkbox either.

Quote from: fanatic26_ on January 09, 2019, 01:35:18 PM

From my data I have not had any correlation between machine states (aka bad board, etc) and the fact that it stops submitting data. The VAST majority of my machines are in perfect working order and all exhibit the same symptoms.

I wish I could say the same about these machines here. It turns out that was a red herring anyway as these machines here were apparently flashed with asicboost in order of descending hashrate (roughly).

Quote from: kano on January 09, 2019, 06:50:17 PM

... use original oem firmware ...

The asicboost "firmware" is straight from Bitmain. Besides, nowadays using non-asicboost firmware with the S9 is pointless and often unprofitable.

kano

legendary

Activity: 4634

Merit: 1851

Linux since 1997 RedHat 4

... use original oem firmware ...

fanatic26_

full member

Activity: 294

Merit: 129

Quote from: tim-bc on January 08, 2019, 11:48:52 AM

Looks like I was mistaken again. So I got some high-level stats and here is what I found...

I have the same issue trying to figure out which are asicboost and which arent. Sometimes the firmware date changes to the Nov 2 date of the ASICboost patch, and sometimes it retains the underlying firmwares date so you actually have to login and look for the LPM checkbox.

The real pain with this issue is the fact that the API and everything else still reports the machine as hashing so if you dont happen to notice in your dashboard that the hashrate is always identical, it can take a while to realize whats happening with the last submission issue.

From my data I have not had any correlation between machine states (aka bad board, etc) and the fact that it stops submitting data. The VAST majority of my machines are in perfect working order and all exhibit the same symptoms.

tim-bc

full member

Activity: 538

Merit: 175

Quote from: ?? on ??

Wow, it's good to know that I'm not the only one with this issue. I do remember that the first S9 asicboost firmware from October was definitely bugged. I had to upgrade those again with the newer one. I too have a script in the works to read LST and reboot automatically.

The only thing is, I found some miners that hadn't been asicboosted yet... they had a firmware from Sept 2017 and had the same issue.

Before deploying any auto-reboot scripts I'm going to generate a report on all of the miners to try to get some stats on asicboost vs non-asicboost miners etc. It should be easy since I already have firmware and asicboost stats in the database.

Looks like I was mistaken again. So I got some high-level stats and here is what I found:

Farm Summary

6.1% of miners had a Last Share time of greater than 1 day, evenly distributed from 1-32 days

99.4% of affected miners were ASICBoosted S9, other 0.6% were ASICBoosted S9i. So the issue is definitely related to ASICBoost firmware.

Weird thing

42.3% of affected miners had LST greater than 7 days. Out of these miners, 99.7% had a (frozen stat) hashrate greater than 11 TH/s.

For the affected miners with LST less than 7 days, only 11.1% of these miners had a (frozen stat) hashrate greater than 11 TH/s. 88.5% of them had one bad hashboard and thus reported between 8-11 TH/s.

fanatic26_

full member

Activity: 294

Merit: 129

For me this issue is 100% related to the ASICboost firmware. It was never an issue before the patches. As a matter of fact, the second ASICboost patch was a fix for the stalling found in the original release.

We have yet to figure out a way to keep them running so we are developing something that can read the LST and reboot as needed as all of the API info freezes at the point it stops mining while still reporting those final functional values as current even when they are not.

tim-bc

full member

Activity: 538

Merit: 175

Hello all,

I've come across a widespread issue in a farm today that I could use your help with.

It started when I received notice that a large number of miners (all of them Antminer S9 base model - mostly 13.5th) appeared to be hashing fine when looking at web miner status or miner API, but on the pool API it showed no hashrate for these miners (no shares being submitted). When I look at these miners through either the status page or API, they appear to be hashing, but all of the stats are frozen. For example, the "elapsed" time does not increase, realtime hashrate does not change, nothing. Kernel log does not change and no error messages are present.

The eerie thing is that LSTime (last share time) still increases as normal. And it also looks like fan_num is always 0, and all fan speeds are zero.

When I reboot the miners, everything comes back up normally. Stats stay updated, fan speeds show up, shares get submitted (confirmed by pool). So what I am planning to do is make a script that checks all of the APIs so that e.g. if hashrate > 1th but last share time > 10min then reboot the miner.

The thing is that I have no clue why this is happening, and what might cause it to reoccur. Any thoughts?

Topic: [Antminer S9] Miner stops hashing - all stats freeze except LSTime (Read 428 times)