Pages:
Author

Topic: Hacking KNC Titan / Jupiter / Neptune miners back to life. Why not? - page 19. (Read 76793 times)

sr. member
Activity: 342
Merit: 250
good point, he's right they should be hot -- you said you repasted so double check BOTH alignment pins are in the holes
legendary
Activity: 2450
Merit: 1002
I dont think the dies themselves even have the hardware(thermistor) to even read die temps. Ive never received factual evidence of this and obviously KNC hasnt coded for it so... I doubt such mechanism exist ... in reality we dont know the actual die temps under that heatsink.
Meaning ... a heatsink could be not mounted correctly and we would never know it outside of stability issues / burning spots on the PCB for certain dies.
It really sucks ... if someone wants to poke n prod and knows more low level code than myself .. go for it =) ... maybe something can be discovered and there may be a thermistor for each die, but at this point I highly doubt it.

I think this must be my problem. It makes no sense in my brain that all 4 dies can run at 250mh but if I put them at 300mh they shutoff. In my brain either something is broken or not. So if they are not 100% broken, then why can they not run at 300mh. The only logical explanation is something is getting too hot that isn't being monitored by the software. And it shuts off dies when they get too hot, even though the Temps in the software are really low. Either that or something is short circuiting when they get too hot.

I felt the copper heatsink tubes and they are all pretty cool. Right now my temps are at 47C and 65C DCDC average temp. And 42C and 54 DCDC average temp on the cube running at 250mh. The last thing I may try is to see if the cover plate for the heatsink that goes over the copper tubes is too tall.

Ideally you have an aluminum plate that goes over the copper section that makes contact with the CPU. And this should be the same height as the copper tubes. Something like this # = Aluminum plate | O = Copper heatsink tubes
######
#OOOO#

But if this plate is too high or the copper is too short, then maybe the copper isn't touching the CPU properly.
######
#OOOO#
#         #


If you touch the actual heatpipes on the heatsink near the base and they feel COOL..... then heat is DEFINITELY NOT being transferred to them from the ASIC ... meaning the ASIC could be EASILY OVERHEATING.
You want the pipes to feel warm / hot ... thats evidence of good heat transfer away from the ASIC.
newbie
Activity: 6
Merit: 0
I dont think the dies themselves even have the hardware(thermistor) to even read die temps. Ive never received factual evidence of this and obviously KNC hasnt coded for it so... I doubt such mechanism exist ... in reality we dont know the actual die temps under that heatsink.
Meaning ... a heatsink could be not mounted correctly and we would never know it outside of stability issues / burning spots on the PCB for certain dies.
It really sucks ... if someone wants to poke n prod and knows more low level code than myself .. go for it =) ... maybe something can be discovered and there may be a thermistor for each die, but at this point I highly doubt it.

I think this must be my problem. It makes no sense in my brain that all 4 dies can run at 250mh but if I put them at 300mh they shutoff. In my brain either something is broken or not. So if they are not 100% broken, then why can they not run at 300mh. The only logical explanation is something is getting too hot that isn't being monitored by the software. And it shuts off dies when they get too hot, even though the Temps in the software are really low. Either that or something is short circuiting when they get too hot.

I felt the copper heatsink tubes and they are all pretty cool. Right now my temps are at 47C and 65C DCDC average temp. And 42C and 54 DCDC average temp on the cube running at 250mh. The last thing I may try is to see if the cover plate for the heatsink that goes over the copper tubes is too tall.

Ideally you have an aluminum plate that goes over the copper section that makes contact with the CPU. And this should be the same height as the copper tubes. Something like this # = Aluminum plate | O = Copper heatsink tubes
######
#OOOO#

But if this plate is too high or the copper is too short, then maybe the copper isn't touching the CPU properly.
######
#OOOO#
#         #
legendary
Activity: 1612
Merit: 1608
精神分析的爸
Likewise my one recommendation would be to put a 30 amp automotive fuse in line to each Titan from that big supply. That way if a Titan shorts a capacitor it will blow the fuse before catching fire.

Actually a super idea would be to put a 10 amp fuse on each one of the three lines that feed each titan's PCI plug. That way if a plug line went bad the other two fuses would blow, saving the titan from the ground fault fail.

that's a pretty good idea, one of the y connector resellers should jump on that with 3 inline fuses on each connector -- I just ordered some more Y-connectors, maybe I'll give it a try

May I ask you for a recommendation on where to buy good Y-connectors ? The ones I found are too thin (AWG18), I think the original ones were AWG16.

TIA
sr. member
Activity: 342
Merit: 250

Oh sorry to hear that, that's a really tough business to make it in. When I was younger I was laidoff a few times but after a period of adjustment always found something better. Glad it worked out, enjoy your new place
legendary
Activity: 2450
Merit: 1002

I noticed in PUTTY that Die Temps dont show up in there. It has a section for it called "hottest temperature", but its blank. I heard that KNC didn't program WAAS drivers or something to retrieve the die temps.

I dont think the dies themselves even have the hardware(thermistor) to even read die temps. Ive never received factual evidence of this and obviously KNC hasnt coded for it so... I doubt such mechanism exist ... in reality we dont know the actual die temps under that heatsink.
Meaning ... a heatsink could be not mounted correctly and we would never know it outside of stability issues / burning spots on the PCB for certain dies.
It really sucks ... if someone wants to poke n prod and knows more low level code than myself .. go for it =) ... maybe something can be discovered and there may be a thermistor for each die, but at this point I highly doubt it.

hey tarkin where you been hiding?

the stability issues of an improperly mounted heatsink, are pretty flagrant --  The usual culprit is the alignment pins not in the holes,  but if you catch it quick you'll be ok. I imagine the fan failure code will shut the dies off, but I haven't tried it and I double check before closing a cube up, it's an easy mistake if you're in a hurry

The fan failure trip is all circumstantial. It uses the TEMP difference of the DCDC's as a determination factor in a possibly dead fan. Theres nothing that Ive seen to detect actual fan failure. Keep in mind this is completely seperate from the temp of the dies(since we have no way of reading the die temp).

As for where Ive been lately, lotsa life stuff going on. Was laid off few months back due to business closure then couple weeks later notified I had to move out in 90 days because owner wanted to sell where I was living, so the past couple months has been all about finding a new job and place to live. Luckily it all worked out. Just finished moving into my new place.... now time to unpack which will take a while as well. LOL!
sr. member
Activity: 342
Merit: 250

I noticed in PUTTY that Die Temps dont show up in there. It has a section for it called "hottest temperature", but its blank. I heard that KNC didn't program WAAS drivers or something to retrieve the die temps.

I dont think the dies themselves even have the hardware(thermistor) to even read die temps. Ive never received factual evidence of this and obviously KNC hasnt coded for it so... I doubt such mechanism exist ... in reality we dont know the actual die temps under that heatsink.
Meaning ... a heatsink could be not mounted correctly and we would never know it outside of stability issues / burning spots on the PCB for certain dies.
It really sucks ... if someone wants to poke n prod and knows more low level code than myself .. go for it =) ... maybe something can be discovered and there may be a thermistor for each die, but at this point I highly doubt it.

hey tarkin where you been hiding?

the stability issues of an improperly mounted heatsink, are pretty flagrant --  The usual culprit is the alignment pins not in the holes,  but if you catch it quick you'll be ok. I imagine the fan failure code will shut the dies off, but I haven't tried it and I double check before closing a cube up, it's an easy mistake if you're in a hurry
legendary
Activity: 2450
Merit: 1002

I noticed in PUTTY that Die Temps dont show up in there. It has a section for it called "hottest temperature", but its blank. I heard that KNC didn't program WAAS drivers or something to retrieve the die temps.

I dont think the dies themselves even have the hardware(thermistor) to even read die temps. Ive never received factual evidence of this and obviously KNC hasnt coded for it so... I doubt such mechanism exist ... in reality we dont know the actual die temps under that heatsink.
Meaning ... a heatsink could be not mounted correctly and we would never know it outside of stability issues / burning spots on the PCB for certain dies.
It really sucks ... if someone wants to poke n prod and knows more low level code than myself .. go for it =) ... maybe something can be discovered and there may be a thermistor for each die, but at this point I highly doubt it.
newbie
Activity: 6
Merit: 0
Thanks TXSteve, the fans are stilling running and running fast, they are about 2 months old. Whats weird is my temps dont rise up and get hot. Like the main Temperature will show 55C and the DC/DC temps show a max temp of 75C and GenTarkin starts shutting off dies. (Previously before Tarkin, dies would just reboot and run for a bit and reboot and run for a bit). Then GenTarkin will say Dies 3.1 3.2 3.3 3.4 have been removed from health check.

Thanks Lightfoot I will check if the heat pipes are getting warm. I think the heatsink is making proper contact. I resurfaced it 4X using various thermal products. My temps dont get that hot, like 55C on main page and DC/DC average is 75C max.

It seems like something is shorting out when it gets slightly hot, like maybe metal is expanding in the heat and shorting. Haha I used to run this thing at like 90C-95C on the DC/DC temps and never had issues. I will keep playing around and see if I can figure anything out.

I noticed in PUTTY that Die Temps dont show up in there. It has a section for it called "hottest temperature", but its blank. I heard that KNC didn't program WAAS drivers or something to retrieve the die temps.
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Are you sure the heat sink on the chip is making proper contact? Those studs bend in, which could cause a small gap. How are the die temps and can you feel the warmth in the heat pipes?
newbie
Activity: 6
Merit: 0
Does anyone have any ideas on what else inside a Titan could be getting hot and causing instability making dies shutoff or reboot. Something getting hot that isn't monitored or displayed in the ROM software?

I have 1 cube that is causing issues and often reboots dies and/or shuts off dies. I have 8 copper heatsinks on 8 little chips (not sure what these are called maybe the DC/DC?) I have the high powered Noctua fans. I resurfaced all CPUS (and on this problem cube I have tried Artic Silver, GELID and ARCTIC MX4). I sawed away most of the heatsink plate (to make room for copper heatsinks) and removed all that rubber greasy thermal stuff. I have GenTarkin ROM as well.

Even after all these mods this 1 cube reboots or shuts off. Even though the temperatures show 53C and 72C DC/DC temp. This is in a hot garage in Florida and the temps are way lower than the old days where the DC/DC would run at like 95C DC/DC temp. During the night this 1 cube can run fine at like 50C and 65C DC/DC temp. But once it gets a little hot outside issues happen. So there must be something "else" that is getting hot that isn't displayed in the ROM software. Any tips or ideas will be appreciated. If I login through PUTTY will that help? I did it years ago, but dont remember if there is anything useful in PUTTY / PI that may show the problems. Thanks All
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Sorry, focused on these weird dead boards that come up, no hash, PSU's appear. They seem to have something in common, in each case the readings from U19 (the Z17C I think) to ground on the pins are different from Neptunes (which I know work). These seem to be the level controllers to match the signals from SPI to the Neptune/Titan chips which run on a different voltage. Still have no idea what this chip is, anyone know?

If these things blow up with enough force to take out the FPGA maybe they are also taking these chips out too.

Investing way too much time in this :-) But curious.
hero member
Activity: 808
Merit: 502
Thanks for the help and suggestions. I have re-pasted all of my cubes in February / March this year. I also replaced the fans and redid the dc/dc with new heat syncs just like suggested in youtube video. I have some good news though. I removed the cube from the controller I had it hooked up to and let it sit for a few days. I then connected it to another controller and it fired up and it started running at full speed. I dropped the frequency to 250mhz and it has been running great for over a day now.
copper member
Activity: 2898
Merit: 1465
Clueless!
I have a Titan cube that was working pretty well for quite some time and now it has started to malfunction. I have done various experiments to troubleshoot. It will work sometimes for a short period if I drop the hashing frequency down to 100 mhz. It runs for a few minutes then it stops hashing. I looked into BFG Miner using telnet and sometimes the entire cube is offline or the frequency drops to zero.

--------------------------------------------------------------------------------
 KNC 0:       | 82.77/80.15/80.61Mh/s | A: 30 R:0+0(none) HW:4/.72%
 KNC 1:       | 81.93/68.07/68.46Mh/s | A: 38 R:0+0(none) HW:1/.19%
 KNC 2:       | 64.31/54.09/54.43Mh/s | A: 32 R:0+0(none) HW:4/.94%
 KNC 3:       | 36.63/32.17/32.30Mh/s | A: 18 R:0+0(none) HW:1/.40%
 KNC 4:       |   0.0/  0.0/  0.0 h/s | A:  0 R:0+0(none) HW:0/none

Does anyone know what can cause this particular symptom?

I swapped the data cable and cleaned the cube mother board nothing changed. I think the asic is still good. I am hoping it is a cap or power supply. Where does the primary oscillator circuit is it in the asic or somewhere on the mother board?

If it is used or old or has not been repasted..this may be an issue to consider (after others with more skills then myself have chimed in)

Again others on here may have better ideas...or if the stuff was like mine and not cleaned up in 24 months or so (the rubber thermal pads degrade into oil all over as an
example and the themal paste when I took it off looked like talcum powder..er don't be me if this seems to be the case)

link to what I did for mods

https://bitcointalksearch.org/topic/m.15286157

Anyway ..I've had 10c to 15c drops in temps on my dc/dc's ..they are running cooler at 84F in basement now by 10c then this last winter when bsmt was 65F

anyway I have cubes I've repasted with no luck (thou they likely are more stable and can run longer and be overclocked w/o issue) and stuff that has come
back..but something to consider outside the fact if it solves your above problem or not

my setup (partial 5 titans hosted and 2 others in another room in basement)

lostgonzo.imgur.com

again others can chime in with other suggestions less drastic first here..but sound like my symptoms (thou with titan if a cube decides to brick itself it usually just does so)

good luck any other ideas/help pm me



hero member
Activity: 808
Merit: 502
I have a Titan cube that was working pretty well for quite some time and now it has started to malfunction. I have done various experiments to troubleshoot. It will work sometimes for a short period if I drop the hashing frequency down to 100 mhz. It runs for a few minutes then it stops hashing. I looked into BFG Miner using telnet and sometimes the entire cube is offline or the frequency drops to zero.

--------------------------------------------------------------------------------
 KNC 0:       | 82.77/80.15/80.61Mh/s | A: 30 R:0+0(none) HW:4/.72%
 KNC 1:       | 81.93/68.07/68.46Mh/s | A: 38 R:0+0(none) HW:1/.19%
 KNC 2:       | 64.31/54.09/54.43Mh/s | A: 32 R:0+0(none) HW:4/.94%
 KNC 3:       | 36.63/32.17/32.30Mh/s | A: 18 R:0+0(none) HW:1/.40%
 KNC 4:       |   0.0/  0.0/  0.0 h/s | A:  0 R:0+0(none) HW:0/none

Does anyone know what can cause this particular symptom?

I swapped the data cable and cleaned the cube mother board nothing changed. I think the asic is still good. I am hoping it is a cap or power supply. Where does the primary oscillator circuit is it in the asic or somewhere on the mother board?
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Ok. No sense crying over spilt milk so I fixed my rig. Turned out my Pi's memory was corrupted as well, replaced that, the FPGA, power chip, etc etc etc and now I seem to be back up and running.

So I have a few boards here that identify but don't hash. I think the blowout I saw on my controller was a clue to what's going on; when the power supplies blow or the pins on the PCIe go bad when you cook the leads *and* you're running with a bigger power supply all hell can blow loose on the SPI bus. The signal bus does go through those level converters on the Titan, my guess is they are either blown open or shorted. I still don't know the part, but they seem to be the same as the ones on the Neptunes. I'll try swapping them with neppie parts and see if the board comes up.
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
Ok, I have a really fun one here. Titan blew up, #4 supply shorted hard enough to break the FETs and twist the PCB board. Never seen anything quite like that, will post pics.

Removed the supply, put it on my test controller. Blew up test controller's FPGA. Great. Replaced FPGA chip, removed #5 supply as well, hooked up with no 12 volt power (oh, the 12 v PCIe is broken down to just pins due to an earlier failure I guess).

Very odd. No power supplies, but the chip temp from the LM75 *does* show up. Doesn't report as a TI. Hm.

Replaced the EERPROM chip from a dead board. Now the EEPROM comes up, I can see the supplies, but the chip will not come up to hash. Just does the voltage going nowhere thing.

Edit: It also appears to have taken out my Rpi as well, as it is not stable. #5 supply had the +12 line shorted to the spi line which put a nice fat 12v on all of the SPI components. Great, back to the drawing board.....

What a mess :-)
copper member
Activity: 2898
Merit: 1465
Clueless!


Off topic kinda..but then again you have to have equip (Unicorns) in order to hack them...so we bagged a herd FYI.

Check out the thread below in market place. Again I have 6 ...2 I took home and 4 are hosting with Maxumark.


Unicorn Herd bagged!

KNC 400mh 5 cube NOS (legit I got 6 ..two at home and 4 hosting with Maxumark) in hand.

here is the thread in market place.

So if my rep on here is worth anything check it out. (Then again it is unicorns if you think I've lost it I understand)

link (with my post #2 on thread for more info)

https://bitcointalksearch.org/topic/m.15757580

Again contact Maxumark not me. I'm just spreading the word. I have met the man spent last week at his house. It is legit for whatever my word is worth on here.


good luck

Searing
legendary
Activity: 3164
Merit: 2258
I fix broken miners. And make holes in teeth :-)
They're difficult to replace due to the high solder temps required for reflow and the limited air you can get under the VRM without dropping off components. Pain in the tail however replacing the top side FETs (which are usually what short) is possible if the underlying pads aren't destroyed.

The bigger problem is if the short caused a voltage spike on the 12v rail. This can blow out other components or boards on the same supply, very annoying. But yes, i can take a look at this and see what's what.
Pages:
Jump to: