Pages:
Author

Topic: Hacking Bitmain Antminers (S7 & S9) because man a lot of these break...... - page 2. (Read 2282 times)

newbie
Activity: 4
Merit: 0
You can order those chips on alibaba. Perhaps even aliexpress.

Could you replace said chip?

If so, just replace, add thermal paste and heatsink, should be good to go.

Pardon if my suggestion isn't the best, just trying to help. Know a good deal about computer hardware but this isn't my expertise.

What is the name of the chip we need to purchase?
Can you post a link to an example?


I too am having issues with a miner that kills my PSUs after only half an hour.
member
Activity: 166
Merit: 83
EET/NASA intern 2013 Bitmain/MicroBT/IPC cert
Speaking of nifty tools here is a video of a S9 hash board booting up and starting to hash normally. This shows a nifty little 2400W ammeter that I built with parts off of eBay. If I had a bigger power supply I could've tested all three at the same time( of course I've never actually owned three working boards, lol). Consequently this is pretty much the only way to prove beyond a doubt that your power supply is fully functional(I Actually built a dummy load out of car head lights be used for this exact purpose).
https://youtu.be/4GV7fyXKBdA
newbie
Activity: 6
Merit: 0
legendary
Activity: 3220
Merit: 2334
I fix broken miners. And make holes in teeth :-)
Gotcha Dave.

In the meantime here is the predator-vision view of an S7 board that powers up the chips but doesn't hash.....



Note the chips glowing normally, and the one chip glowing red. One of these things is not like the others, and in this case it's probably a shorted chip. I've pulled it for review, will swap in another chip this week and see if that fixes it.

Note: The orientation of the chips is weird, they alternate 180 degrees as you go from chip to chip on the board. Probably to better line up signal pins, but a bit confusing regardless.
newbie
Activity: 8
Merit: 0
Hey lightfoot, my account won't let me send more that 1 pm an hour, please text me on the number I pm'ed you a few weeks ago.  My zip is 74008.
Thanks,
Dave
legendary
Activity: 3220
Merit: 2334
I fix broken miners. And make holes in teeth :-)
So anyway, time to break out the real fun debugging tools. As we used to say in thermodynamics, heat is the ultimate bullshit generator. Thus if you have unusual amounts of heat or lack thereof somewhere on a board, something is up.....

So let's plug our 62 chip board into a power supply with no connection to a controller board (steady voltage, nothing hashing) and then take a look at our board here under the eye of a Predator.....



Look at that. Yes the chips on the inside will be warmer than the outside ones, and yes that big blob of heat is the FETs for the power supply. Normal, but what the hell is that heat blob over on the left there. It looks like one chip is not like the others.....

Time to remove that heat sink and see what's going on there. It's one of the chips that doesn't have a second sink on the bottom (they have a delta V of chips without sinks, maybe airflow improvement but very stupid in a series/parallel arrangement) and see what is going on there.

Mr. Thermal is your friend.
legendary
Activity: 3220
Merit: 2334
I fix broken miners. And make holes in teeth :-)
Nice work!

I had a board like that with intermittent 0 and not 0 asic. 

So if the chips have core vcc (which they do, your number is right) but they don't talk, what's next?

We're assuming they do (well, at least 62 did); my guess is the backplane is series/parallel with all three chips in parallel on the power and ground plane as opposed to three true serial strings tied together at both ends. Nice because the power plane is more stable and uniform, bitch because an open chip would be masked by its' neighbors (although you might see this in heat maps, as the chip would not be running at idle and its partners would be a bit warmer because they are carrying more current through them to the next series of three. Hm, where is my peek....)

Quote
They need IO vcc, they need clock, and they need an unbroken connection to rx and tx on the header.
Hm. Is each chip wired to rx/tx on the header, or do they daisy chain between the chips? There's advantages to either way, but if they were all in parallel and one chip grounded it would sink the whole line (and rx/tx would read zero). If series any one chip could sink the string if it went open. Hm.

Quote
And they need to be alive, but since they came up once it's likely they are, and are just suffering an intermittent issue with one of the other items.
Maybe. If one of the 63 put the tx/rx signal to ground or if it broke the chain that would show up as a dead board. The question is which one is doing it?

On titans as a comparison, the 4 main dies on each chip are connected to a common signal bus that can be isolated per chip by removing a 0 ohm jumper. However the hotel power and ground cannot, therefore if a die shorts hard the board is junk. If it shorts soft you can isolate the signal, and if it fails open you just have three dies running.

Back to the S9, there's also a second supply on this board, looks to be a 14.5 volt supply, I was wondering if that was series shared hotel power for the hashing chips.

Quote
How good's your scope?
Pretty good, it's an older Tektronix T922. Main problem is it's only a 15mhz scope, I should upgrade it one of these days.
jr. member
Activity: 112
Merit: 4
Nice work!

I had a board like that with intermittent 0 and not 0 asic. 

So if the chips have core vcc (which they do, your number is right) but they don't talk, what's next?

They need IO vcc, they need clock, and they need an unbroken connection to rx and tx on the header.

And they need to be alive, but since they came up once it's likely they are, and are just suffering an intermittent issue with one of the other items.

How good's your scope?
^.^
member
Activity: 81
Merit: 10
Here's one for you I think the temperature sensor on my s7 board has gone faulty. Will this stop it mining and more importantly where the hell is the damn thing, I can find it on the older boards but not on this one. Is there also a way to by pass it? Plenty of cooling so it's not going to overheat
legendary
Activity: 3220
Merit: 2334
I fix broken miners. And make holes in teeth :-)
This is interesting. On one of the boards from last night's test I see that it did come up once briefly....

Code:
Check chain[7] PIC fw version=0x03
Fix freq=550 Chain[7] voltage_pic=6 value=940
set_reset_allhashboard = 0x0000ffff
set_reset_allhashboard = 0x00000000
Chain[J8] has 62 asic
set_reset_hashboard = 0x00000080
set_reset_hashboard = 0x00000000
retry Chain[J8] has 62 asic
set_reset_hashboard = 0x00000080
set_reset_hashboard = 0x00000000
retry Chain[J8] has 62 asic
set_reset_hashboard = 0x00000080
set_reset_hashboard = 0x00000000
retry Chain[J8] has 62 asic
set_reset_hashboard = 0x00000080
set_reset_hashboard = 0x00000000
retry Chain[J8] has 62 asic
set_reset_hashboard = 0x00000080
set_reset_hashboard = 0x00000000
retry Chain[J8] has 62 asic
set_reset_hashboard = 0x00000080
set_reset_hashboard = 0x00000000
retry Chain[J8] has 62 asic
Chain[J8] has no freq in PIC, set default freq=550M
Chain[J8] has no core num in PIC

Miner fix freq ...
read PIC voltage=940 on chain[7]
Chain:7 chipnum=62
Asic[ 0]:550
Asic[ 1]:550 Asic[ 2]:550 Asic[ 3]:550 Asic[ 4]:550 Asic[ 5]:550 Asic[ 6]:550 Asic[ 7]:550 Asic[ 8]:550
Asic[ 9]:550 Asic[10]:550 Asic[11]:550 Asic[12]:550 Asic[13]:550 Asic[14]:550 Asic[15]:550 Asic[16]:550
Asic[17]:550 Asic[18]:550 Asic[19]:550 Asic[20]:550 Asic[21]:550 Asic[22]:550 Asic[23]:550 Asic[24]:550
Asic[25]:550 Asic[26]:550 Asic[27]:550 Asic[28]:550 Asic[29]:550 Asic[30]:550 Asic[31]:550 Asic[32]:550
Asic[33]:550 Asic[34]:550 Asic[35]:550 Asic[36]:550 Asic[37]:550 Asic[38]:550 Asic[39]:550 Asic[40]:550
Asic[41]:550 Asic[42]:550 Asic[43]:550 Asic[44]:550 Asic[45]:550 Asic[46]:550 Asic[47]:550 Asic[48]:550
Asic[49]:550 Asic[50]:550 Asic[51]:550 Asic[52]:550 Asic[53]:550 Asic[54]:550 Asic[55]:550 Asic[56]:550
Asic[57]:550 Asic[58]:550 Asic[59]:550 Asic[60]:550 Asic[61]:550
Chain:7 max freq=550
Chain:7 min freq=550

max freq = 550
set baud=2
Chain[J8] set working voltage=940 [6]
setStartTimePoint total_tv_start_sys=167 total_tv_end_sys=168
restartNum = 2 , auto-reinit enabled...
do read_temp_func once...
do check_asic_reg 0x08

get RT hashrate from Chain[7]: (asic index start from 1-63)
Asic[01]=71.4200 Asic[02]=58.5860 Asic[03]=63.6860 Asic[04]=63.3500 Asic[05]=64.3570 Asic[06]=61.4880 Asic[07]=60.3810 Asic[08]=61.5380
Asic[09]=63.5520 Asic[10]=56.8910 Asic[11]=59.5420 Asic[12]=63.0320 Asic[13]=60.8500 Asic[14]=57.7470 Asic[15]=62.7970 Asic[16]=64.5750
Asic[17]=64.4070 Asic[18]=66.4200 Asic[19]=56.5890 Asic[20]=64.8940 Asic[21]=62.4950 Asic[22]=63.6860 Asic[23]=60.8840 Asic[24]=59.5080
Asic[25]=61.9070 Asic[26]=64.7930 Asic[27]=60.9180 Asic[28]=63.6350 Asic[29]=58.8710 Asic[30]=60.3810 Asic[31]=63.7700 Asic[32]=65.3300
Asic[33]=59.5750 Asic[34]=60.9340 Asic[35]=58.5020 Asic[36]=65.6150 Asic[37]=67.2430 Asic[38]=63.7700 Asic[39]=69.1550 Asic[40]=67.3430
Asic[41]=63.2830 Asic[42]=66.5380 Asic[43]=64.1890 Asic[44]=61.3540 Asic[45]=59.8100 Asic[46]=65.2960 Asic[47]=67.3770 Asic[48]=61.2700
Asic[49]=61.7560 Asic[50]=61.7560 Asic[51]=61.7230 Asic[52]=65.2800 Asic[53]=64.1720 Asic[54]=65.3470 Asic[55]=64.3400 Asic[56]=60.3300
Asic[57]=59.3910 Asic[58]=63.2660 Asic[59]=67.0080 Asic[60]=66.5710 Asic[61]=60.9340 Asic[62]=62.4610 Check Chain[J8] ASIC RT error: (asic index start from 1-63)
Done check_asic_reg
do read temp on Chain[7]
Done read temp on Chain[7]
set FAN speed according to: temp_highest=0 temp_top1[PWM_T]=0 temp_top1[TEMP_POS_LOCAL]=0 temp_change=0 fix_fan_steps=0
set full FAN speed...
FAN PWM: 100
read_temp_func Done!
CRC error counter=6567
In other words it came up with 62 Asics briefly, and a high CRC error number. So maybe this is a chip problem. If so, which one.......

Hm.
legendary
Activity: 3220
Merit: 2334
I fix broken miners. And make holes in teeth :-)
Ok, it's 3 strings of 21. If you follow the chips from the 3 on the left all the way around you see voltages like (all referenced to ground)

8.58, 8.15, 7.75, 7.27, 6.44, 6.02, 5.605, 5.164, 4.74, 4.2, 3.7, 3.2, 2.8, 2.4, 1.6, 1.2 .8, .4, 0.
(9.1 volts at source)

Or each chip pulling about .4 volts. Makes sense.

Likewise it looks like the three chips are run in parallel, so you get .5 ohms from one chip to another in the three chip set. Have to think before I do a chip to chip test, I don't want my multimeter to back-feed voltage and damage anything...

However when we fire up the board we get:
Miner Type = S9
set_reset_allhashboard = 0x0000ffff
set_reset_allhashboard = 0x00000000
set_reset_allhashboard = 0x0000ffff
set_reset_allhashboard = 0x0000ffff
Check chain[5] PIC fw version=0x03
Fix freq=550 Chain[5] voltage_pic=6 value=940
set_reset_allhashboard = 0x0000ffff
set_reset_allhashboard = 0x00000000
Chain[J6] has 0 asic
set_reset_hashboard = 0x00000020
set_reset_hashboard = 0x00000000
retry Chain[J6] has 0 asic

So if the chips are not shorted then they probably are not the source of the problem. A dead shorted chip would also raise the voltages around the string and would probably blow up pretty quickly. An open chip would not be spotted by this test, as the ground planes are probably wired together and would mask an open chip.

I did leave one board powered up for a bit, and the heat sinks eventually warm up. Didn't see a temp differential on the top or bottom heat sinks from sink to sink, so all engines are probably up and idle.

Hm..... Next question is it's either the support chips, or the signal line is cut. But if cut, where.....
legendary
Activity: 3220
Merit: 2334
I fix broken miners. And make holes in teeth :-)
Yep, you're right, I forgot that the bottom (power plane) heat sinks are hot (pulls heat from the chip through the board). I'll make a map, and will post that and the ground fuzzing as a start.

Ultimate question is of course what is shutting down the board? We have three items:

1) hashing chips themselves
2) Power circuitry
3) Signal and support circuitry.

If it's the chip itself then it would fail either open or short. Short can be found using the map technique: Look at the voltages and find the one that reads 0 between adjacent chips. Open is easier, one string will show voltage on the first chip in the chain but no others. Finding the exact chip would then be done by measuring chip to chip resistance, one of them is going to read zero.

A side question in my brain is what's the clock and signaling circuitry like: If they all share a common clock signal then a shorted chip would ground out the clock, which could be measured at the 25mhz crystal. Likewise if they daisy chain the data signal, then a shorted chip will not pass the signal or will ground it.

Back to the drawing board after I get some other stuff done. I'll see if these other two boards have a dead chip.

If it is a chip, then it's possible to remove with air tools and a fair bit of preheat. These look a bit easier than normal QFN chips, as they are thinner (warm up more quickly) and they have those nice big power strips on the side and center which should auto-center them. Hm.

Pulling the heat sinks is not too hard, just warm up the board then use air at a lower temp to soften the glue, then pull sink then clean top of chip to remove. Do you have a pin map of the chip itself, I could hot wire a diode and try it out.

63 chips would be 9*7 or 3*3*3. So we either have 9 strings of 7 (no), 7 strings of 9 (maybe) or 3 strings of 21 (don't know about that). One way to find out....
jr. member
Activity: 112
Merit: 4
The short heatsinks are soldered to the chips ground planes, meaning that after the first tier they are "hot".

Don't set a board on a metal workbench, lol.

But the main point is I wanted to tell you, lightfoot, that you might measure all the backside heatsinks to get yourself a voltage map of the board.  I don't agree that it's 7x9, nor that 28nm chips have a 1v core voltage - that's a little high for 28nm.

Flip a board so the short sinks are up, apply 12v with the ribbon disconnected so you get the 9.6v idle voltage from DC-DC, and measure all the heatsinks.  You should get a nice voltage map of the tiers which could help you debug if you get a board with blown tiers someday.

PM me if your numbers aren't consistent, I can send it but since you are digging in to these devices I think it will be a useful exercise and good technique to practice.

Best!
legendary
Activity: 3220
Merit: 2334
I fix broken miners. And make holes in teeth :-)
Interesting. I took the plug off a wrecked S7 spare, then put it on this S9 I'm working on. Now the unit flashes the led 7 times then goes solid for a minute, then goes out, never hashes. This is an improvement, but not perfect. I did however clear the pins 100%, so unless we have broken vias in there this is not fully the problem.

However reflowing the pins did get one other board working, so I now have a reference point. Next step is to check all resistances to hot and ground, to see if there are differences between a good board, a sorta not working board, and a dead board.

Drat.
legendary
Activity: 2520
Merit: 1719
Electrical engineer. Mining since 2014.
Anyone know the digi key part number for that plug? I'm thinking of pulling one and seeing what happens. Wouldn't be the first time something like this has happened......

I remembered seeing this old post by MarkAz.

It is most likely some connector model manufactured by JST.

Quote from: MarkAz
I can't tell you the exact model, because there are literally a ton, but I can tell you who is probably the manufacturer, and that is JST - here's some of the family of products that I thought looked like close matches:

http://www.jst-mfg.com/product/detail_e.php?series=583
http://www.jst-mfg.com/product/detail_e.php?series=645
http://www.jst-mfg.com/product/detail_e.php?series=105
http://www.jst-mfg.com/product/detail_e.php?series=191
http://www.jst-mfg.com/product/detail_e.php?series=275

If you try contacting them and sending pictures, they might be able to tell you for sure...
https://bitcointalksearch.org/topic/m.11516859
legendary
Activity: 3220
Merit: 2334
I fix broken miners. And make holes in teeth :-)
Other interesting things:

I'm wondering if the white data/comm plugs are being damaged and not making contact: The soldering job for those headers is really poor to be honest, and I've noticed that it seems to be easy to deform if you plug and unplug a few times. Plus when I try to reflow the solder I often find one or two pins that wiggle around when the solder is molten.

Likewise on one board finger pressure on the cable is enough to have the led light to come on 7 times (when it is testing each string) and stay on for awhile.

Anyone know the digi key part number for that plug? I'm thinking of pulling one and seeing what happens. Wouldn't be the first time something like this has happened......
legendary
Activity: 3220
Merit: 2334
I fix broken miners. And make holes in teeth :-)
Anyway, been awhile, been working on things. Some thoughts about S9's:

There are 7 strings of chips in the S9, each string has 9 chips for a total of 63 total.

When you power up a S9 board without any controller, the voltage between neutral and the choke should be about 9 volts or so. This is the resting voltage, and comes out to be about 1v per hashing chip. About right.

Now, when the board is connected to a controller and you power up, you should see only 200mv or so while the controller boots. Thus the controller keps the FETs in a mostly off condition. Not totally off, as we shall see, but mostly off.

As the unit starts to boot you will see the voltage on the choke go to about 9.5 volts, then stay there. The led on the edge should flash (this is the reset command) then flash quickly for a few seconds (loading the hashes) then go solid on once running.

If you see the voltage go to 9.5, then drop to zero 7 times this is the controller trying to reset the chips. Problem is the chips are not responding, which is the most common failure.

Once the strings have checked out (each time a string is checked the LED flickers briefly) the voltage goes to about 9.6v, the lights stay on, and the unit starts hashing.

Note: You need solid power supplies to keep an S9 going and to let it start; if it doesn't see a solid 12v voltage it will refuse to start.

Note: You need two fans running. if you have only one it will start to hash then shut down within a few seconds.

Note: Slushpool works fine with S9 miners.
newbie
Activity: 2
Merit: 0
I have several screen shot and temps are between 69-77 degrees reported at the time of the failure...i will try moving the hashing board to the middle slot now
legendary
Activity: 3220
Merit: 2334
I fix broken miners. And make holes in teeth :-)
I have a new antminer - hashing board will start hashing and then around 8-48 hrs later one of them just stops hashing with all asics reporting healthy. - i swapped the controller cable with another hashing board and the problem followed the hashing board.  I pulled the hashing board and everything looked new and aligned with no solder breaks (that i could see) - any ideas lightfoot



The only way to recover it now is either a reboot, which works sometimes, or a factory reset
Hm. Can you watch the temps with the board running, or try moving the board to the middle and see if the problem is in the position as opposed to the board.
newbie
Activity: 2
Merit: 0
I have a new antminer - hashing board will start hashing and then around 8-48 hrs later one of them just stops hashing with all asics reporting healthy. - i swapped the controller cable with another hashing board and the problem followed the hashing board.  I pulled the hashing board and everything looked new and aligned with no solder breaks (that i could see) - any ideas lightfoot

https://raw.githubusercontent.com/blockopsmining/minerimages/master/hashingissues.png

The only way to recover it now is either a reboot, which works sometimes, or a factory reset
Pages:
Jump to: