Hacking GPU cards back into operation because I need something to do.... - page 2.

alucard20724

sr. member

Activity: 703

Merit: 272

i just pulled out my OOC box of gpus.

There's:
one R9 290,
three R9 280X
six 7970 with EK waterblocks, two cards of which have burn gnd/pwr traces (water leak i think)
one R290X with ek waterblock... still in system.. too much of a pain to remove it.

you say you drill holes in teeth.. have you ever done a pcb repair?... that's the one thing i've never done.

AmDD

legendary

Activity: 1027

Merit: 1005

Cool, I'll be watching this. I also can add my name to the list of people willing to send a few broken cards to you. I should have some 7950's around somewhere.

64dimensions

hero member

Activity: 578

Merit: 508

Either of you fix PC power supplies?

I have an Antec 1300W that is bad.

helipotte

hero member

Activity: 650

Merit: 500

Pick and place? I need more coffee.

It does not. This card has voltage test points on the back edge. While powered up, When I check the voltages this is what I get:

12V good This is only the power from the pci-e slot
3.3V good This is only the power from the pci-e slot
1.8V good Don't know what this is, suspect power for the PLX pci-e bridge chip
0.9V good Pci-e I/O
GDDR5 dead supply for the 32 GDDR5 modules
gpu1/core dead This is the low resistance gpu (1.0 ohm)
gpu2/core good This is the normal resistance gpu (2.5 ohm)
gpu1/mc dead This is the memory controller for the "bad" gpu (3.5 ohm)
gpu2/mc good This is the memory controller for the "good" gpu (35 ohm)

I suspect the start-up sequence this card uses is to power up the core and memory controller for each gpu first and then the memory due to it being shared.
Likely it's firmware goes:

gpu1-->good?-->no-->crowbar.
gpu2-->good?-->yes-->turn on power.
memory-->do not turn on due to gpu 1 crowbar.

Sound plausible?

lightfoot

legendary

Activity: 3164

Merit: 2258

I fix broken miners. And make holes in teeth :-)

Quote from: helipotte on January 22, 2017, 10:49:32 PM

I have two XFX cards. A 280 and 280x that I got with dead shorts on the VDDC (gpu core) supply. Turned out to be a smd ceramic cap on the back!
Found it by putting a D cell battery across the VDDC and looking at the card with a thermal camera. The battery had just enough current to make the
bad cap glow, but not enough to damage anything.

Currently have a Devil 13 dual 290x that has some strange measurements. One of the gpus shows much lower resistances than the other. Looking like it might
have a fried GPU. Do wish I could find pinouts or datasheets on these things. I am going to try lifting the choke(s) to confirm where the shorts are. This card is absurd as far as the power goes. It has 15 power phases at 40A each. 5 per gpu core, 1 for each gpu memory controller and three for the GDDR5. That's 120A just for the memory!

Nice job! I've used a 30ah SAFT NiCD cell as a tester like that, enough current to heat up the short while keeping the voltage below what will blow up a hash engine or GPU. That second one could point to a bad high side FET, does it crowbar the power supply when plugged in by chance?

helipotte

hero member

Activity: 650

Merit: 500

Pick and place? I need more coffee.

I have two XFX cards. A 280 and 280x that I got with dead shorts on the VDDC (gpu core) supply. Turned out to be a smd ceramic cap on the back!
Found it by putting a D cell battery across the VDDC and looking at the card with a thermal camera. The battery had just enough current to make the
bad cap glow, but not enough to damage anything.

Currently have a Devil 13 dual 290x that has some strange measurements. One of the gpus shows much lower resistances than the other. Looking like it might
have a fried GPU. Do wish I could find pinouts or datasheets on these things. I am going to try lifting the choke(s) to confirm where the shorts are. This card is absurd as far as the power goes. It has 15 power phases at 40A each. 5 per gpu core, 1 for each gpu memory controller and three for the GDDR5. That's 120A just for the memory!

lightfoot

legendary

Activity: 3164

Merit: 2258

I fix broken miners. And make holes in teeth :-)

So back from my trip, finished the backlog of work that came in, and can fiddle with these boards a bit more.

First up is an AMD R9/270 that I picked up on Ebay broken. Sure enough it didn't come up as a display, but was registered by the computer. So something was working, just not the whole thing.

The board

It's really pretty simple: On the left is the high power circuits for the GPU, on the right is a lower power circuit pair for the memory, and hotel circuits.

If you look more closely at the left side you can see how this is powered: Five separate chokes indicate 5 power supplies. The little FQDN chips next to the chokes are the FETs plus drivers plus dead-time logic. The lettering on them is hard to read but I can see they are Fairchild 6705B half bridge buck drivers. They contain the switching logic, a high and low side FET, and appropriate logic to determine cut-through and current sensing (via the little RC circuits at the bottom of the board). Pretty simple actually, according to the docs each one can handle 40a of current, so we're looking at a max of 200a into the GPU. About right.

The question is: What is happening? Normally the high side FET shorts, in which case the cut-through circuitry crowbars the low side and shuts down the controller. The problem with that is if +12 was connected to the GPU the low resistance of the GPU would essentially short out your powr supply or blow the GPU sky high. Given that neither are happening I'm not sure if the failure is in the FETs. It's possible the low side FET blew, but since the Rds switching time is mostly on for the low side in a buck converter they don't usually ever short out. Plus the voltage drop on the high side is much higher (going from 12v to 1 instead of 1v to 0) so the high side FET normally blows.

Hm.... One way to find a shorted supply is to pull the chokes and check the resistance of the circuit at the output/FET side of the choke. The bad supply will read high (or low if the low side FET is shorted) and you're in business. Or if the FETs are exposed you can look for a short between gate and source or drain. Normally when a FET blows the gate is shorted as well.

Need to think more about this.

m1n1ngP4d4w4n

full member

Activity: 224

Merit: 100

CryptoLearner

Woah, true electronic repair guys are so rare nowadays Shocked

, keep up the good work man Cool

lightfoot

legendary

Activity: 3164

Merit: 2258

I fix broken miners. And make holes in teeth :-)

Yup, BFL's had 6 phase power supplies which really reduced ripple but went batshit if the FET drivers (2708's in the old days) would short. Likewise you had an RC circuit across each inductor, this would tell you the current across the inductor and was compared with the phase position from the LM driver to adjust for FETs running hot or cold. Problem is if that RC circuit goes out of whack then FETs start exploding.

Oddly enough Titan/Neptunes did it right: They bought off the shelf supplies, synced them in pairs, then placed them on the board around the chip in an implosion design so that no die in the chip was ever further away from one supply than the other. This matters on a 6-12 phase system since the distance from inductor to die can vary by an inch and while an inch seems like a small distance when you're pulling 400 watts per die at .5v that's 800a of current and P=I^2/R so I gets very big even if R is very small.

I'm guessing a similar situation exists in the GPUs. Gotta run to Boston this weekend, if anyone's up there and wants to grab lunch let me know. Next week I'll start posting some pics of a blown 270 board I have here.

C

helipotte

hero member

Activity: 650

Merit: 500

Pick and place? I need more coffee.

All of the AMD cards I have worked on have three dc/dc supplies on them.

1) 1.5V for the GDDR5. This one is usually fixed and consists of two phases.
2) 0.9V - 1.0V for the GPU memory controller I/O. This is controllable via firmware and is often one or two phases.
3) 0.8V - 1.2V for GPU core. This is always firmware controlled and is usually at least 4 phases but some cards can have 10 or more. Shocked

I have seen shorts on all of these. I have an Asus 280x that has all THREE shorted. I have checked the gate of each mosfet and they are all good.
Thinking about pulling the chokes but when I apply current limited power to it and watch with a thermal camera the GPU die heats up. Strange.

One of the cards that keeps popping mosfets is an older Nvidia 760ti. This card always blows just ONE high side fet. It will turn on, post then a variable
amount of time later (5-10 minutes) the PSU shuts down (shorted fet). I feel the gate controller is to blame, could be wrong. This card has 4 core phases and it is
not always the same one that pops it's high side fet. This card also has the same core resistance as a working card. Huh

lightfoot

legendary

Activity: 3164

Merit: 2258

I fix broken miners. And make holes in teeth :-)

Quote from: m0niker on January 11, 2017, 11:39:39 PM

Will you be able to post pictures along with what you find when fixing the GPUs? It would be awesome to learn, if all goes well I'll try to find some dead cards for you to fix Wink

Absolutely. I did that in my Titan and Neptune threads, it's fun to do.

m0niker

newbie

Activity: 39

Merit: 0

Will you be able to post pictures along with what you find when fixing the GPUs? It would be awesome to learn, if all goes well I'll try to find some dead cards for you to fix Wink

lightfoot

legendary

Activity: 3164

Merit: 2258

I fix broken miners. And make holes in teeth :-)

So in terms of GPUs and such I've noticed that there are not a whole lot of chips on a GPU aside from the main chip. So I am guessing the GPU chip pretty much has it all.

if I think about this like an SGI Indigo ELan or Reality Engine we have these parts in the video system:

Sequence command decoder
Raster memory and Display Generator
Z buffer memory and Texture memory
GE7 Geometry engine GPUs.

They require different power supplies: The decoders and memory are normal 1.8/3.3 volt systems that pull a small amount of current to serve as the hotel load. The real power is needed for the Geometry Engines, and even they had a normal supply for the sequencers and buffers with the big power reserved for the transformation engines, lighting, shading, and polygon calculations.

So if we have a board that appears as a device in Windows it's probable that the hotel circuits are running, but the vector processors are out which might be those power supplies. I'll start sorting boards based on if they come up at all, come up with bad screens (probably Z buffer or raster memory errors) or something else.

Hm.

lightfoot

legendary

Activity: 3164

Merit: 2258

I fix broken miners. And make holes in teeth :-)

You would think one could run diagnostics on the things to find the bad memory cards; those are easy to swap out but yes, a pain.

Have you tried checking the resistance with the inductors off? Back in the BFL days that was the #1 best way to identify which FET was shorted and also a way to identify a shorted die (0 ohms means infinite current no matter how you slice it).

C

helipotte

hero member

Activity: 650

Merit: 500

Pick and place? I need more coffee.

Nice to see you working on GPU's. I have a few stencils coming for tahiti/pitcairn/hawaii and I will take a crack at re-balling some of the units I have. They look
like they use 0.5mm balls. Can send you some of my "trouble" units if you want to try to fix them. I have a few units that keep popping mosfets. Have been trying
to find out a way to narrow down bad memory chips on cards. Don't even know if this is possible without changing them one at a time.

Cheers!

hhdllhflower

newbie

Activity: 18

Merit: 0

Reserved
nice job Cool

lightfoot

legendary

Activity: 3164

Merit: 2258

I fix broken miners. And make holes in teeth :-)

Quote from: ?? on ??

I had a GPU die in quite a silly way, the PCI extender I was using (16x to 1x) I had the 1x plugged in the wrong way, and apparently this killed the card through an extender. If this is something you think you can fix, I will gladly send it to you for shipping cost.

Sure. I'll PM you my address.

C

lightfoot

legendary

Activity: 3164

Merit: 2258

I fix broken miners. And make holes in teeth :-)

Yes, I am in the US, feel free to PM me as needed.

FFI2013

hero member

Activity: 906

Merit: 507

I have a gigabyte r9/270 you can check out I lost the receipt to RMA but if your in the us I can ship it to you. I also have a gridseed blade that needs to be looked at

adaseb

legendary

Activity: 3808

Merit: 1723

The most common failure with GPUs are the fans. Depending which type of fan it uses it can be repaired in different ways.

The Sapphire Dual-X R9 280X, Gigabyte Windforce fans, all have fan blades that you can easily pop-off using some string and then relube the bearing with grease. Works everytime pretty much.

The more durable fans like on the ASUS 7970 / ASUS 280x / MSI 280x you need to drill a hole in the back slightly off-centre and pour in the thinnest oil that can fit inside. This sometimes works great ... sometimes works but rattles.... reason being that lube would be best however its impossible to lubricate the bearing.

For the newer RX 470 / 480 the fans will probably start failing sooner or later however for those most have 2-3 year warranty and you can just RMA them.

Topic: Hacking GPU cards back into operation because I need something to do.... - page 2. (Read 3956 times)