Pages:
Author

Topic: Hacking KNC Titan / Jupiter / Neptune miners back to life. Why not? - page 34. (Read 76793 times)

sr. member
Activity: 453
Merit: 250
Well I just bought 10 Cubes with 3 controllers, hopefully they work as he stated. I have been following this thread but now I may end up playing along. First thing I will try just powering up the controllers alone and try each cube one by one.
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
legendary
Activity: 2450
Merit: 1002
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Edit: Damn, I am wrong. They are not ADP20's as I thought. But they are something. Let's describe what they are here.

First, they have power input from pins 4 and 6 of the SPI board. Interesting. Power comes in on pin 5, I know this because they put a stupid cap to ground and they always seem to do that. Likewise Pin 2 seems to be ground.

Pin 1 and 3 are unusual, they tend to go to pin 1 of the 10pin on u18, pin 9 on U17, etc.

Pin 4 and 6 go somewhere except on U19, where pin 4 goes to pin 3 on the 10 pin and 6 goes to gnd maybe.

The chips are 6 pin TSOT, the labeling is Z17R on Titans, Z17c on Neptunes except for one which is labelled THR.

So what are they? Jury is out, but something has to be powering the hotel on the chips.

More damn, trying to search on those parts is giving me crap like this:



Ok, fine, cute,but not what I need right now.

I really wish someone at KNC would just drop me a damn hijavascript:void(0);nt....
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Question: From the command line is it possible to screen over to see the cgminer process running? I'd like to find out the error rates.

Yes, you need to login with username 'pi' instead of 'admin', then screen -r will bring it up.
Bingo! Thanks for that, with screen up and running I don't have to lobotomize bfgminer (though it's nice to do if you want logs).

Talking to SPI directly and playing with WAAS tells me all sorts of things. Like the fact that on this Titan with two working engines and two non-working (voltage shows up, small current draws)

Die 0 is insane. Powers up, can query, but can't set the nonces. Mumbles sometimes on the SPI, corrupting results from engine 2.

Die 1 is dead as a doornail. 4 amps of nothing.

Die 2 and 3 are great. Leaving die 0 off gives a solid 40mh.

With die 0 on, 1,2,3 off:

 [2016-02-12 22:36:00] KnC 1-0: Found TITAN die with 571 cores
 [2016-02-12 22:36:00] KnC 1-1: No KnC chip found
 [2016-02-12 22:36:00] KnC 1-2: No KnC chip found
 [2016-02-12 22:36:00] KnC 1-3: No KnC chip found

...

[2016-02-12 22:36:06] KnC 1-0: Failed to set nonce range (wanted 0x01, get 0x00
)
 [2016-02-12 22:36:06] KnC 1-0: Failed to set nonce range (wanted 0x01, get 0x00
)
 [2016-02-12 22:36:06] KnC 1-0: Failed to set nonce range (wanted 0x02, get 0x00)
 [2016-02-12 22:36:06] KnC 1-0: Failed to set nonce range (wanted 0x02, get 0x00)
 [2016-02-12 22:36:06] KnC 1-0: Failed to set nonce range (wanted 0x03, get 0x00)
 [2016-02-12 22:36:06] KnC 1-0: Failed to set nonce range (wanted 0x03, get 0x00)
 [2016-02-12 22:36:06] KnC 1-0: Failed to set nonce range (wanted 0x04, get 0x00)
 [2016-02-12 22:36:06] KnC 1-0: Failed to set nonce range (wanted 0x04, get 0x00)
 [2016-02-12 22:36:06] KnC 1-0: Failed to set nonce range (wanted 0x04, get 0x00)
 [2016-02-12 22:36:06] KnC 1-0: Failed to set nonce range (wanted 0x05, get 0x00)

...

[2016-02-12 22:36:13] KnC spi:
 [2016-02-12 22:36:13] 00000000   82 59 81 00 ff ff f2 04  00 00 00 5e f2 33 e3 a6 |.Y.........^.3..|
 [2016-02-12 22:36:13] 00000010   3e 2d 18 36 1c b2 4d 90  06 9d 90 51 d8 85 92 0e |>-.6..M....Q....|
 [2016-02-12 22:36:13] 00000020   e5 43 cd ec dc c6 53 86  6f e2 7c e3 b1 a9 7a 6b |.C....S.o.|...zk|
 [2016-02-12 22:36:13] 00000030   a6 a1 44 5c 17 f2 bb 9c  50 e6 22 45 6f 1d 48 b1 |..D\....P."Eo.H.|
 [2016-02-12 22:36:13] 00000040   16 ca e7 25 ea 4b 14 cf  82 4e 99 48 5e be 56 3e |...%.K...N.H^.V>|
 [2016-02-12 22:36:13] 00000050   38 01 1b 6f b2 86 cd 00  00 00 00                |8..o.......     |
 [2016-02-12 22:36:13] 00000000   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 00000010   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 00000020   00 1a 7a 52 b3 00 00 00  00 00 00 00 00 00 00 00 |..zR............|
 [2016-02-12 22:36:13] 00000030   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 00000040   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 00000050   00 00 00 00 00 00 00 81  00 ff ff                |...........     |
 [2016-02-12 22:36:13] KNC 0[1:0:65535]: Core busy (0)

Basically flipflops between that and:

 [2016-02-12 22:36:13] KnC spi:
 [2016-02-12 22:36:13] 00000000   82 a7 80 00 00 00 cc 1d  69 27 00 00 00 00 00 00 |........i'......|
 [2016-02-12 22:36:13] 00000010   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 00000020   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 00000030   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 00000040   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 00000050   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 00000060   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 00000070   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 00000080   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 00000090   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
 [2016-02-12 22:36:13] 000000a0   00 00 00 00 00 00 00 00  00                      |.........       |

Yep, it's stupid. Interesting.



legendary
Activity: 1098
Merit: 1000
Question: From the command line is it possible to screen over to see the cgminer process running? I'd like to find out the error rates.

Yes, you need to login with username 'pi' instead of 'admin', then screen -r will bring it up.
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Having Die 1 showing no voltage might actually be easier of a repair than the ones that show voltage but no hashing. Your issue is most likely shorted low-side FETs holding the power supply for that core to ground. Let me know if you want to send one in for a review.

C
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
There is one odd thing: The heat sink compound on a 2 die down titan that came in was very odd. It looked almost as if it had been slathered on one side of the chip more than the other. This would cause pressure on the chip top and I have seen other types of miners where asymmetrical pressure cracks a chip and causes all sorts of weird stuff to happen.

I still need to check the tiny chips on the board, I think they interface between the SPI bus and the hashing chip but I will say that fixing a Titan that shorts out the power supply is do-able, fixing a dead die on a Titan is not very possible, and fixing a Titan board that blows out the SPI interface is a nightmare bitch on wheels.

Question: From the command line is it possible to screen over to see the cgminer process running? I'd like to find out the error rates.
copper member
Activity: 2898
Merit: 1465
Clueless!
Man this poor little guy *really* has a headache.

So I have it sort of "alive"

DC/DC   Voltage (V)   Current (A)   Power (W)   Temperature (°C)
0   0.8136   2.0703   1.684   35.200
1   0.8121   2.1641   1.757   34.600

I have 3 cubes with similar situation, Die #1  low amp, low power and low temp but other three hashes at -.0586 V and 325 MHz, so I keep die 1 off in all three cubes.

DC/DC   Voltage (V)   Current (A)   Power (W)   Temperature (°C)
0           0.0143           0                   0.000   41.600
1           0.0060           0                   0.000   37.500
2           0.7566           41.2500           31.210   71.900
3           0.7625           40.3750           30.786   74.400
4           0.7729           41.6250           32.172   71.000
5           0.7694           42.3125           32.555   66.600
6           0.7583           40.5625           30.759   65.400
7           0.7567           40.9375           30.977      65.900

If you can figure out what causes this, I am sure you would have plenty of customers.


ditto...I think my 2 dead dies show some of the above issues....looks familiar at least for one

I don't know electronics (I like reading your stuff...no idea what it means exactly it is like listening in on NASA for ASIC Nerds) Smiley

But anyway a LOT of 1st BATCH (first couple months) of 2014 Titans have DEAD DIES could the above be a symptom of such
ie....that is why there were so many that sucked when they went out the gate...bad quality control perhaps that you (or others)
can now MAYBE tweak?

just saying...finding a common error in all the above NOT a bad ASIC issue would be a lot better then the KNC argument of the chips were iffy

anyway have exhausted my attempt at contributing to this will go back and just lurk on the 'tech speak' etc



legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Ok. Hooked up pin 8 though, that doesn't shut down the controller anymore. Good.

However unit still doesn't hash. Voltages are up, temps are stable, but nothing when I fire up for hashing. Either the short on pin 4,6 is still a problem, or something else is up. Will try hooking up the second die to see if I can get the one that was 800k resistance to ground working, maybe that one will have something.

legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
It may actually be damage to the chip die itself: One of the things I found in the hulk is that one corner has a short to ground when both of the signal pins are lifted, the other three don't. That's not something that can be fixed, but thank the universe that KNC built these things to run with a die or two down.

Letting it cool down from the latest rework, I disconnected all 8 lines and pin 8 is now proper (infinite) resistance to ground. Now to see which die was running with an 8k short-out.

Never dull.

hero member
Activity: 895
Merit: 504
Man this poor little guy *really* has a headache.

So I have it sort of "alive"

DC/DC   Voltage (V)   Current (A)   Power (W)   Temperature (°C)
0   0.8136   2.0703   1.684   35.200
1   0.8121   2.1641   1.757   34.600

I have 3 cubes with similar situation, Die #1  low amp, low power and low temp but other three hashes at -.0586 V and 325 MHz, so I keep die 1 off in all three cubes.

DC/DC   Voltage (V)   Current (A)   Power (W)   Temperature (°C)
0           0.0143           0                   0.000   41.600
1           0.0060           0                   0.000   37.500
2           0.7566           41.2500           31.210   71.900
3           0.7625           40.3750           30.786   74.400
4           0.7729           41.6250           32.172   71.000
5           0.7694           42.3125           32.555   66.600
6           0.7583           40.5625           30.759   65.400
7           0.7567           40.9375           30.977      65.900

If you can figure out what causes this, I am sure you would have plenty of customers.
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Man this poor little guy *really* has a headache.

So I have it sort of "alive"

DC/DC   Voltage (V)   Current (A)   Power (W)   Temperature (°C)
0   0.8136   2.0703   1.684   35.200
1   0.8121   2.1641   1.757   34.600
2   0.8168   1.9062   1.557   32.400
3   0.8173   2.0547   1.679   34.000
4   0.8149   0.5908   0.481   35.400
5   0.8163   1.9297   1.575   36.100

But although it is registering temps, it is not powering up the full hashing. Also I noted that pin 8 while not shorted will shut down the connection to the controller if connected, so it's disconnected right now. So we have:

2 power supplies off.
Pins 4 and 6 hot wired to a 2.5 volt supply
Pin 8 disconnected

And it thinks it is at 127c every once in awhile, then down to 30c. Oh well, no one is perfect. Still, no hashing.

So what's up?

Not sure, but I think I have a thought popping up. What if they had two i2c connections on each unit. One would handle the house crud, the EEPROM, LM75, and the 8 power supplies. The other goes to the chip via a common set of lines or something.

Well, I found pin 8: It's a common to the four dies on one of those pair of lines that go to each corner which can be disconnected by the um... 0 ohm resistors. So in theory I could power up the heat tool, disconnect each line, and find the one that is high. With that I can get pin 8 to communicate again, leaving the mystery of where the other line to each chip goes and is. Is it a supply? Clock? or SCL?

Enquiring minds want to know....
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Answer, mixed.

Working on the second board is kind of odd. On this one pins 4 and 6 are once again connected and read low resistance. However putting 2.5 volts from a regulated supply on the lines brings the current draw overall to 2.5 amps. Odd. That is enough to play "find the heat". And the heat is found, it's in the Titan die.

Crap.

Then again maybe not so much crap: Something is powering up. I think I'm going to sleep, and will fiddle with this more tomorrow. Maybe what I can do is bypass these two power supply lines by using my cut ribbon cable and explicitly supplying them with 2.5 volts from the big supply. Maybe something will come up.
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Very. Very. Interesting.

So I'm working on the hulk of a Titan board, wreck, taking apart. Pins 4 and 6 were shorted to ground, blows controllers, two more just like it from a major power supply existence failure.

After removing just about every component in existence on the 4 rail (which is vcc for the accessory devices) I gave it a shock on that line with 5 volts several amps. Line cleared, 9k resistance to ground. Pin 6 is still dead, it is a totally hard short, not sure where it is or goes. Crap.

Anyway, Plugged it into the scratch Neptune controller, didn't blow anything, didn't do anything. Plugged it into the scratch titan, nothing.

Except. Except that slot reported as type TI. Which means the EEPROM came online and checked in when queried. It has no more power supplies, has no more hope, but it did squeak which indicates:

1) The SCL line and clock lines are undamaged
2) Pin 4 is the power line for the accessories.

And oddly enough the temp chip didn't say anything. Hm. Hm de hm de hm...

I don't want to power shock the second board, I'd rather find the bad component. I wonder what would happen if I put a new LM75 chip on there....
 
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Hm. Pin 4 short on the 10 pin line to a dead Titan board cleared with a few amps of power at 3.3 volts. Drat, probably silicon.

Pin 6 however is hard shorted. No amount of current will clear, and I see no indication of connection to the chip or anything on the board. Weird as hell. Need to source pin 6 on the controller board and see exactly what line that is.

Another thought: I found four lines going to the chip, one per corner. Each one has a very clear set of 0 ohm jumpers between board and chip, clearing the jumpers shows solid resistance on the lines. It's possible those are the SCL connections to each chip corner, will continue to review. But clearing them did *not* clear the faults. Damn I wish I had a fracking connection map for this thing. Hate vendors who don't release anything. Ug!

Ahem. Ok, back to other things.
legendary
Activity: 2450
Merit: 1002
I did purchase GenTarken software and it is working great! It has a feature that warns you a cube is overheating. It tells what cube and then shuts off that die. I use this to determine if the cube needs to be refurbished and do a PM on it to get it to cool down. Its great software and I highly recomend it.

Thanks for your kind words & feedback! =)
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Hm. Some progress on the Titans and really wrecked controller boards:

I'm mapping out the schematic for the power supply on the control board, it looks like some of the traces were damaged. Specifically one out of the three supplies comes up, one is a wild voltage, and one is down. I think the issue is the choke line on the second and a short on the third. Will continue working on it.

On the Titan end, I'm tracing down those faults on the SDA lines. They are in the chip, that much is certain. However the clock still seems to be ok, so what I am wondering is can I find a way to cut the trace to the chip itself. I know how KNC types would think, if they knew they could have an SDA lines they would have a way to bypass it. Hm....
hero member
Activity: 808
Merit: 502
I did purchase GenTarken software and it is working great! It has a feature that warns you a cube is overheating. It tells what cube and then shuts off that die. I use this to determine if the cube needs to be refurbished and do a PM on it to get it to cool down. Its great software and I highly recomend it.
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Another happy customer. Glad everything's back to normal!
Pages:
Jump to: