Pages:
Author

Topic: Hacking Antminer S17's and T17's because.... are we up to these already???? (Read 469 times)

legendary
Activity: 4088
Merit: 7701
'The right to privacy matters'
Well this is an annoying step back: The boards ran, but would still drop out from time to time, always with the same chips. So chip replacement time which means I needed a new testing unit. Picked one up and asked for the T17+ firmware but instead got the T17 firmware for a 48 chip unit.

Warning: This is bad. When I plugged it into the boards it kept coming up finding zero chips, which wasn't making a whole lot of sense. Then when I plugged the board back into the Braains controller it refused to recognize the boards saying the checksums were wrong. Apparently the tester changes the firmware on the board, which is not good. More annoying, the code for the T17+ testing suite has errors in it that cause it to not work at all.

Back to the drawing board. Now I need to re-flash the chips which is a pain if you don't have the code. If the T17+ tester code doesn't work I'm going to have to send these into a repair center. Boo!

I'm starting to remember why I stopped fixing Antminers at the S9 level. So many versions, so many quirks, and not great assembly quality. Ah well...

However a warning for listeners: Make absolutely sure you have the right testing code running on your miner, don't trust the seller or anyone. The wrong test code can trash the boards.


Anything new on this?
legendary
Activity: 1988
Merit: 1561
CLEAN non GPL infringing code made in Rust lang
this is good news in the sense that it will be reported as a bug.

also I am not sure if it was the december the april or the june firmware.

they may have fixed it already.

Well, I'm not up to reporting it as a bug, but even with *every* "Do not tune" option off, it still wants to shut down boards in the name of "tuning". When I run the boards at say 400mhz they will run at much slower hashrates even than what is "expected" (the nominal hashrate). This is enough to trip the tuner that runs every 30 minutes

By chance I just happened to read this thread, and I'm forwarding this info to the devs, thank you for your valuable feedback Smiley
legendary
Activity: 3080
Merit: 2228
I fix broken miners. And make holes in teeth :-)
Honest answer: No.

Longer answer: I think the T series units are the "binned" chips and parts from the Antminer world. They put the best chips after testing into the S series units, and put the more "oddball" series chips in a lower performance box that is underclocked, undervolted, and call it the "T" series. Likewise I think boards that have mediocre solder jobs and the like are put into the "T" bin, matched, and sold as units.

Thus with proper firmware a T17+ could hash at a higher than badged rate. However those chips may fail in use under conditions that an S series box won't. This is why I've been merrily chasing chip failures on the T17's, sure you can replace a chip but another one is not quite so far behind.

Antminers are not quite the um.... highest quality things I have worked on. They blow up easily, solder isn't that great, and the difference between peak chip temp and the point where the chips fall off the board is not a whole lot of C. The cases are good, and the controllers are mostly good.

Another thing to watch is the earlier T17's and T15's used clips to hold the boards to the power supply rail while the later ones used real screws. Given the current draw on these boards I think the clips and loose screws could explain a number of burned top boards. Go for the screws.

If you're going to get one I'd recommend you do new (because everyone who buys one will run it with nuts firmware till it starts dropping boards, then sell it as "used") and run it with the supplied firmware (which is probably warrantied to work as long as the warranty lasts). Or run it with faster firmware, then sell it on Ebay when it starts tossing errors (just disclose it, pls).

Sorry, been busy for a bit here, will post some more observations next week including some pics of boards that had a bit of a.... shipping failure. :-)
newbie
Activity: 14
Merit: 0
Would you guys recommend Antminer T17 at this point in time? I want to get into ASIC (bitcoin) mining and I know a local reseller that is offering them for $1200 used. I have read they can be problematic but that price point is appealing. FYI I pay 6 cents per kWh.

Thanks!
member
Activity: 68
Merit: 40
S17's and such have a nice little extra feature: They have that copper top that has solder on it, remember?

Then take out the board and put the heat sink on. Remember, no pre-heat, put a bit of flux on the center of the heat sink, line up the sink perfectly (the flux will hold it in place) then low flow heat from the air tool to secure the heat sink again.

I'm a bit confused by this. If you put on a new chip and clean the solder off from the bottom of the removed heat sink. You don't need solder paste and just throw the heat sink on with a little tacky flux because that copper on top of the new chip has solder in it?
legendary
Activity: 3080
Merit: 2228
I fix broken miners. And make holes in teeth :-)
Well this is an annoying step back: The boards ran, but would still drop out from time to time, always with the same chips. So chip replacement time which means I needed a new testing unit. Picked one up and asked for the T17+ firmware but instead got the T17 firmware for a 48 chip unit.

Warning: This is bad. When I plugged it into the boards it kept coming up finding zero chips, which wasn't making a whole lot of sense. Then when I plugged the board back into the Braains controller it refused to recognize the boards saying the checksums were wrong. Apparently the tester changes the firmware on the board, which is not good. More annoying, the code for the T17+ testing suite has errors in it that cause it to not work at all.

Back to the drawing board. Now I need to re-flash the chips which is a pain if you don't have the code. If the T17+ tester code doesn't work I'm going to have to send these into a repair center. Boo!

I'm starting to remember why I stopped fixing Antminers at the S9 level. So many versions, so many quirks, and not great assembly quality. Ah well...

However a warning for listeners: Make absolutely sure you have the right testing code running on your miner, don't trust the seller or anyone. The wrong test code can trash the boards.
legendary
Activity: 4088
Merit: 7701
'The right to privacy matters'
Nice work. I suspected it was flawed from the getgo.

All in all I got about twentythree  17 variants.

a lot of s17pros. they have been very good to me.

but a few bad t17+ and or t17e units.

They need fixing.

We can figure a good way to nurse them along one at a time due to the higher earnings we will-get  coin wise. next jump.

I can see by my other s17 set to brains two boards must have some poor solder flow. also I have a board that dropped a heat sink.

All in all I think of 69 boards in 23 units.

I had two boards fully die out. I sold them
I had four boards become weak you just repaired one.

Which means 63 boards are good.  Still getting all gear fixed means a lot of earnings.

Crazy hot last three day 96+ each day.
legendary
Activity: 3080
Merit: 2228
I fix broken miners. And make holes in teeth :-)
Meantime I've been watching the logs on this T17 and noticed that board 1 will occasionally throw this error:

Code:
syslog.old.6:Wed Jun 30 01:24:59 2021 daemon.err bosminer[5684]: Jun 30 05:24:59.640 INFO CHAIN/1: Discovered 14 chips

Looks like the chip after 14 is having an intermittent connection. Great, I love intermittent connections. So took the board out, put it on the preheater, took off the heat sink and here's a few pics:

On the preheater with the sink for chip 15 removed:


Close up of the chip. The side closest to the intake fans had crud on the pads and of course solder balls from a bad factory flow. Bad factory!


After a reflow: The pads no longer have solder balls as the solder is adhered to both the chip *and* the board. Preheat will do that...


And now the board is back in and running with 44 chips. I'm letting it tune as all three board are now in good shape.
legendary
Activity: 3080
Merit: 2228
I fix broken miners. And make holes in teeth :-)
I have six units with pc access on brains.

all s17 pros.

I will try doing this asap.

So freq 400 set to 17 volts was doing 37.5 th at 1434 watts

now have freq 400 set to 16.1 volts and it is doing? 38.1 th at 1310 watts with large temp drop from 86 c to 81 c


next unit

freq at 480
volts at 17.1
temps are 86.6c
watts are 1671

going to only change the volts to 16.1



Hurray! I'm helping reduce global warming by making miners more efficient with this thread! When I get the Nobel prize I'll be sure to mention this forum...
legendary
Activity: 3080
Merit: 2228
I fix broken miners. And make holes in teeth :-)
If you mean the 'one Vcore for all' boards, I've mentioned that several several time in the past. Using the PSU to do all regulation saves several % eff vs each board having final on-board Vcore regulators and is a large part of the higher power eff of modern miners.

I was thinking dropping the voltages manually results in better efficiency. I did wonder if they had additional regulators on the board itself to fine tune the voltage for each chip bank, but whatever they have on there is exceptionally small compared to a real power supply (I should buzz that out someday).

A Higher base voltage to the boards is always great because it means less current, at the same time the little TO series fets they put on boards were always a weak point. Using a nice big-o TO247 in the power supply allows you to have higher switching frequencies, better multi-phase buckers, and overall a lot less problems.

Quote
I assume what is happening is that like Canaan does the software is slightly tweaking the the speed per-chip based on tracking some majik error-rate parameter. Canaan likened it to fiddling with a radio receiver s/n ratio

Probably: We long ago noticed on the Titans (which were great, they had 8 nice power supplies on the board running in an imploder mode to power each corner of the die in a split phase interleave) that if you cut voltage on the die the efficiency went up right to the point where the chip would start throwing a lot of errors. Tarkin made his mark by writing some code that would step the voltage down on a die a notch, then watch the error rate. If the errors went over 1% it backed the voltage up a notch and called it a day.
Then you bump up the clock rates up a tick and see how it works at a higher rate.  Properly tuned a pile of Titans are *STILL* mining at a profit 7 or so years later.

Lower voltage at the chip=less heat which means the chips can run at a higher clock rate. What's weird is a higher base voltage causes the chip to mine *slower*. Maybe there is some internal regulation in the chip that turns off dies if the voltage is too high or something; wish I could go into Brains' mining code and have it display dies and cores like BFGMINER did. Maybe I can.....
legendary
Activity: 3598
Merit: 2490
Evil beware: We have waffles!

Anyway, just an update: Cutting the voltage really makes a nice difference in temps and power usage. Currently at 16 volts per board and seems a lot happier than 17. Now at 95% real to normal hash rate ratio (remember back at 18.25v I was at 75%) and running 75c with fans at 50% instead of the earlier 100%.

Wonder if anyone else figured this out. Ok, back to writing about how to fix the boards....
If you mean the 'one Vcore for all' boards, I've mentioned that several several time in the past. Using the PSU to do all regulation saves several % eff vs each board having final on-board Vcore regulators and is a large part of the higher power eff of modern miners.

I will try doing this asap.

So freq 400 set to 17 volts was doing 37.5 th at 1434 watts
now have freq 400 set to 16.1 volts and it is doing? 38.1 th at 1310 watts with large temp drop from 86 c to 81 c

next unit

freq at 480
volts at 17.1
temps are 86.6c
watts are 1671

going to only change the volts to 16.1
I assume what is happening is that like Canaan does the software is slightly tweaking the the speed per-chip based on tracking some majik error-rate parameter. Canaan likened it to fiddling with a radio receiver s/n ratio
legendary
Activity: 3080
Merit: 2228
I fix broken miners. And make holes in teeth :-)
Ok, so how did I fix that board? It was reporting only 33 chips and was dead in the water.

The first clue is the chip count. The board was able to start at the bottom left and count its way up on chips. The count is in a serpentine pattern from the big heat sink on the bottom (1) then up, then over, then down, then over then up.... If you count carefully you see that chip 33 is not at the edge of the board on the big side heat sinks, but one chip in.

Given that the hottest spots on the board are the chips where the exhaust fans are (that's why they have bigger heat sinks than the ones by the intake) it's a pretty safe bet to assume that chip 34 is probably open. So we remove the heat sink on chip 34.

To pull a heat sink you want to heat the sink but not the chip under it. So you do not use pre-heating and instead set your air tool to about 350c, low air flow (so the air doesn't just rush off the sink) and heat the sink for about 60 to 90 seconds. When you tap the heat sink lightly with a pliers and it moves you're done.

When you pull a sink, mark its position on the heat sink (say the chip number). That way when you put it back you have exactly the same amount of solder on the chip top and the sink. I'd recommend against adding solder, too much and it will drop onto the chip with the usual results.

With the chip off, let the board cool down and take a look at the chip with your loupe or 10x magnifying goggles. You'll probably see crud on the pins of the chip on the side, clean that off with isopropyl alcohol and a wooden splinter. Don't use metal tools you will scratch the board or damage the traces.

Once the chip is nice and clean, look at it. If it doesn't look burned or cracked you can try a reflow. Now I see lots of people pulling chips with a set of tweezers and an air tool blasting hot air straight down.

I think this sucks.

You have to heat the chip up *and* the board underneath. Blasting it with air like that will heat the chip up well beyond the solder melting point while it is trying to heat the pads underneath. Because the pads are cooler the solder will not flow as well and you will probably BBQ the chip. Not to mention blowing the smaller components off.

So what do you do?
First, you flux. After cleaning those pin landing pads and getting all the grunge out from the spaces between the pads you put on a *small* amount of flux on each side of the chip. Think "top of toothpick" amount of flux per side. Don't leave a lot, it's supposed to conduct heat, burn off any impurities and help the solder flow.

Next step: You warm the board up first with a pre-heater.

For a board like this I recommend a nice Aoyue 863 IR preheater. Not only is it big enough to hold the whole board, not only is it nice looking, it also has two external temp sensors so you can measure the *board* temperature and keep it from getting too hot.

How hot?
Well, I like to put the control sensor (B) under the board and touching one of the lower heat sinks. The other sensor (C) I put on top of the board under a heat sink near the chip I want to reflow. That way I control the temperature to not get the bottom too hot (which can cause heat sinks to drop off, embarrassing) while watching the top temp to come close to my set point (the heater will cycle on and off but the heat will soak through the board.

So how hot?
I like to pre-heat the board to 100c. It's well below the melting point of the solder, but at the same time it allows the air tool to only have to bring everything up 150 or so C (peak temperature of solder) quickly and evenly. Plus since the board is already warm the solder will easily flow onto both the board pads and the chip pads to make a nice proper connection that Antminer just can't seem to do right all the time...

Then you set your air tool to a low airflow (you don't need to blow things off, more make a little bubble of warm above the chip), about 320c for about a minute and a half tops. Watch the top temp sensor, it's under a nearby heat sink so it's not getting the blast but it should climb up to around 200c or so.

S17's and such have a nice little extra feature: They have that copper top that has solder on it, remember? When it starts turning shiny then the solder is flowing and since you have preheated the board the solder on the landings will turn shiny as well. Hold for a few seconds, then let off the heat.

Let the board cool down with it still on the preheater. Then turn off the preheater after a minute or two and leave the board alone. Don't touch it, screw with it, etc. It will take a good 15-20 minutes for the board to cool down, let it be.

Then set you miner to 50mhz speeds, put the board in without the heat sink, and try it. You will probably see 44 chips, if you do pull the plug immediately. Then take out the board and put the heat sink on. Remember, no pre-heat, put a bit of flux on the center of the heat sink, line up the sink perfectly (the flux will hold it in place) then low flow heat from the air tool to secure the heat sink again.

That's it. Yes it takes time and you really want a pre-heater. I found the value of those when I was fixing KNC titan and neptune boards: The amount of copper in there would wick away the heat from a blowtorch. Pre-heat is your special friend.

legendary
Activity: 4088
Merit: 7701
'The right to privacy matters'
Any voltage drop would probably be a good indicator of a loose power connection. That would result in hilarity pretty quickly; I see why they went with bolts instead of those tabs on the S15's. And why you find burned boards around the top on ole Ebay...

Anyway, just an update: Cutting the voltage really makes a nice difference in temps and power usage. Currently at 16 volts per board and seems a lot happier than 17. Now at 95% real to normal hash rate ratio (remember back at 18.25v I was at 75%) and running 75c with fans at 50% instead of the earlier 100%.

Wonder if anyone else figured this out. Ok, back to writing about how to fix the boards....

I have six units with pc access on brains.

all s17 pros.

I will try doing this asap.

So freq 400 set to 17 volts was doing 37.5 th at 1434 watts

now have freq 400 set to 16.1 volts and it is doing? 38.1 th at 1310 watts with large temp drop from 86 c to 81 c


next unit

freq at 480
volts at 17.1
temps are 86.6c
watts are 1671

going to only change the volts to 16.1

legendary
Activity: 3080
Merit: 2228
I fix broken miners. And make holes in teeth :-)
Any voltage drop would probably be a good indicator of a loose power connection. That would result in hilarity pretty quickly; I see why they went with bolts instead of those tabs on the S15's. And why you find burned boards around the top on ole Ebay...

Anyway, just an update: Cutting the voltage really makes a nice difference in temps and power usage. Currently at 16 volts per board and seems a lot happier than 17. Now at 95% real to normal hash rate ratio (remember back at 18.25v I was at 75%) and running 75c with fans at 50% instead of the earlier 100%.

Wonder if anyone else figured this out. Ok, back to writing about how to fix the boards....
legendary
Activity: 4088
Merit: 7701
'The right to privacy matters'
legendary
Activity: 3080
Merit: 2228
I fix broken miners. And make holes in teeth :-)
this is good news in the sense that it will be reported as a bug.

also I am not sure if it was the december the april or the june firmware.

they may have fixed it already.

Well, I'm not up to reporting it as a bug, but even with *every* "Do not tune" option off, it still wants to shut down boards in the name of "tuning". When I run the boards at say 400mhz they will run at much slower hashrates even than what is "expected" (the nominal hashrate). This is enough to trip the tuner that runs every 30 minutes:

Code:
Jun 28 20:52:29.793 INFO Tune/all: ----- TUNER ITERATION -----
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: Jun 28 20:52:29.794 INFO Tune/1: Evaluated configuration result[iter=2]: voltage:[18.29 V] hashrate:[8645.45 GH/s / 11827.20 GH/s=0.731], chips[underperf/max_expected]:[26/0], power[now/required/limit]:[537 W/598.78/611 W], result:[REJECTED], reason:[Underperforming chip count is above threshold]
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: Jun 28 20:52:29.794 INFO Tune/all: --- 1 ==> RESULTS stage=6 iter=2 voltage=18.2875 ---
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: mcr=73.1% measured_hr=8645.45 calculated_hr=11827.20 avg_freq=400000000
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: power_limit=611 calculated_power=537 measured_power=None
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: chip_below_ratio=59% chip_count=44 chip_count_below_threshold=26
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: satisfactory=false config_from_iter=0
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  0   71        71        83        78     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  1   80        83        72        77     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  2   53        72        80        66     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  3   81        81        73        52     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  4   69        81        78        68     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  5   82        62        71        70     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  6   72        70        72        81     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  7   70        83        74        71     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  8   64        52        73        74     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  9   80        76        63        78     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400 10   83        72        73        81     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: Jun 28 20:52:29.799 INFO Tune/3: Evaluated configuration result[iter=2]: voltage:[18.29 V] hashrate:[8913.19 GH/s / 11827.20 GH/s=0.754], chips[underperf/max_expected]:[19/0], power[now/required/limit]:[537 W/598.78/611 W], result:[REJECTED], reason:[Underperforming chip count is above threshold]
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: Jun 28 20:52:29.799 INFO Tune/all: --- 3 ==> RESULTS stage=6 iter=2 voltage=18.2875 ---
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: mcr=75.4% measured_hr=8913.19 calculated_hr=11827.20 avg_freq=400000000
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: power_limit=611 calculated_power=537 measured_power=None
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: chip_below_ratio=43% chip_count=44 chip_count_below_threshold=19
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: satisfactory=false config_from_iter=0
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  0   71        78        72        82     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  1   83        74        81        70     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  2   73        78        82        82     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  3   83        71        84        73     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  4   73        78        76        82     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  5   82        80        77        64     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  6   72        76        76        82     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  7   70        73        79        63     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  8   67        74        77        79     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400  9   75        81        74        78     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]:   400 10   77        77        43        73     
Mon Jun 28 20:52:29 2021 daemon.err bosminer[8761]: Jun 28 20:52:29.799 INFO Tune/all: Status: measuring performace for 1800 seconds

T17's tend to have "binned" chips that weren't good enough for the expensive S17's and the like. That's part of the reason they are cheaper and kind of a hard knock life sort of thing. But if I run the chips at 500mhz I get 13.51,13.41 out of 14.78 "Nominal" (11% off nominal) as opposed to 25% off "nominal" at 400mhz. Also I cut the voltage down to a fixed 17 volts instead of the default of 18.25 and it's running cooler. I might try 16v just to see how it works.

It is interesting to note that the boards have very limited ability to regulate voltage *on the board* so mismatched boards in a miner box can probably cause chaos. Voltage regulation is done once at the power supply which is nice (you can put big juicy FETs in the power supply and cool the hell out of them) but makes this thing much more of a unit than say an S9 (where the +12 is regulated to 9 or so on a per board basis).

legendary
Activity: 4088
Merit: 7701
'The right to privacy matters'
Another note before I write up how to reflow these chips: Braains really, really wants to auto-tune your chips.

This is nice, but the auto-tuner doesn't always work right and winds up shutting down boards for long periods of time while it tries every combination of voltage and frequency. It also likes to spit out messages like "glitches is above threshold" without any sort of clue of definition of what a glitch is, or why there is a threshold.

Anyway when testing boards I like to turn auto-tuning off and run at nice comfortable fixed frequencies. 500mhz is a nice reasonable speed, 400mhz if it's really hot outside and 600mhz if it's cool. But as I mentioned Braiins loves to auto-tune and it will even try to tune if you turn the tuning to *off*

It will even be tuning when the web console explicitly says tuning is disabled....

From what I can see: If you have *any* configuration option in the autotuning or dynamic power scaling sexctions with any value it will assume tuning and will drive your boards nuts. Even if you have the checkboxes for tuning cleared....

This post brought to you by 2 hours of troubleshooting and wondering why boards started dropping offline for no reason :-)

this is good news in the sense that it will be reported as a bug.

also I am not sure if it was the december the april or the june firmware.

they may have fixed it already.

legendary
Activity: 3080
Merit: 2228
I fix broken miners. And make holes in teeth :-)
Another note before I write up how to reflow these chips: Braains really, really wants to auto-tune your chips.

This is nice, but the auto-tuner doesn't always work right and winds up shutting down boards for long periods of time while it tries every combination of voltage and frequency. It also likes to spit out messages like "glitches is above threshold" without any sort of clue of definition of what a glitch is, or why there is a threshold.

Anyway when testing boards I like to turn auto-tuning off and run at nice comfortable fixed frequencies. 500mhz is a nice reasonable speed, 400mhz if it's really hot outside and 600mhz if it's cool. But as I mentioned Braiins loves to auto-tune and it will even try to tune if you turn the tuning to *off*

It will even be tuning when the web console explicitly says tuning is disabled....

From what I can see: If you have *any* configuration option in the autotuning or dynamic power scaling sexctions with any value it will assume tuning and will drive your boards nuts. Even if you have the checkboxes for tuning cleared....

This post brought to you by 2 hours of troubleshooting and wondering why boards started dropping offline for no reason :-)
legendary
Activity: 4088
Merit: 7701
'The right to privacy matters'
Nice work  Grin

I want to talk about cleaning s17 .

Proper way involves 25-27 screws removed fully disassembled make sure you are anti static.

Depending on your room they will be dusty.

They will have a lot of dust under the controller case enough to micro short the psu.

The psu can take strong air blower to clean it since the small fans in the psu can spin really fast.


I have found my s17s get large dust right at the intake heat sinks.

I found a fast cleaning method. four screws on the intake fans expose the heat sinks on the boards.

the very heat sinks that get the air flow blocking dust bunnies.

I have a high quality shop vac from fein. it has a 1 ⅜ inch hose I use an adapter that allows a 1 ¼ inch. horse hair brush the brush and the adapter are 7 bucks on ebay.

Now these are aways the coolest heat sinks since the into fans are blowing cool air on them.
now these are almost always the dustiest heat sinks.
so four screws and assess to clean them
and if you want four more screws to blow clean the contoller.

the time saved is huge.


you only need clean the coolest heat sinks.

you dont have to do 25-27 screws and a full disassembly .

And with just the intake plate and of course the attached fans are removed to vacuum the lead heat sinks.

they are cool and the rest is all protected.

it is pretty easy to inspect the unit with that plate removed. simple use a flashlight and shine inside the unit.

you will be able to see any deep dust or bugs.

I find 90 to 95 of the dust and 97% of blocked air flow in right at the intake heat sinks.

I turned a two job at the farm into Four or five hours work.

I was inspired by lightfoot and this thread.
I will link the horse hair brush and adapter.

https://www.ebay.com/itm/384228713329?hash=item5975d0e371:g:rjUAAOSw4RRgzAs7
legendary
Activity: 3080
Merit: 2228
I fix broken miners. And make holes in teeth :-)
Well, that wasn't too complex, this T17+ is back up and running at about 45.Th. So I thought I'd post a summary of what went wrong and how I fixed it.

Scenario: Board 1 and 2 would come up, then drop to zero, then come up, then down over and over. Board 3 was hard dead with 33 dies reporting.

Diagnosis: As always, try to isolate the problem. So I took out all three boards, labeled them (with the client order number and A/B/C), checked them over (no burns or loose heatsinks) and cleared any dust off them.

Then I put them in one at a time:
  • Board 1: This one hashed fine at about 14th using autotuner. Didn't drop out, worked well for 4 hours. Ok, it's probably fine.
  • Board2: This one was odd: It was starting, running for a minute, then stopping. Logs showed the auto tuner resetting the voltage every 2 minutes. Disabled autotune, works fine at 500mhz, but not at 600mhz. Fair enough, I can disable autotune.
  • Board 3: Dead as doornail. See below.

The next step was to run them two at a time: Put boards 1 and 2 into the unit, set the auto tune to off and 600mhz for board 1, 550mhz for board 2. Ran fine with about 32th for 12 hours. Ok, we have two boards running rather happily.

Now it's time for board 3. This one was reporting 33 dies, and since the T17 models are reporting via the chip serial bus we probably had a failed die 34. Fair enough, that happens normally people replace the chip. However sometimes it's not that and I'll post the results in the next write up on how to fix these chips.

(Hint: It's a lot better than the crummy S9 boards which is a serious improvement....)
Pages:
Jump to: