Pages:
Author

Topic: Hacking GPU cards back into operation because I need something to do.... - page 2. (Read 3882 times)

sr. member
Activity: 703
Merit: 272
i just pulled out my OOC box of gpus.

There's:
one R9 290,
three R9 280X
six 7970 with EK waterblocks, two cards of which have burn gnd/pwr traces (water leak i think)
one R290X with ek waterblock... still in system.. too much of a pain to remove it.

you say you drill holes in teeth.. have you ever done a pcb repair?... that's the one thing i've never done.
legendary
Activity: 1027
Merit: 1005
Cool, I'll be watching this. I also can add my name to the list of people willing to send a few broken cards to you. I should have some 7950's around somewhere.
hero member
Activity: 578
Merit: 508
Either of you fix PC power supplies?

I have an Antec 1300W that is bad.
hero member
Activity: 650
Merit: 500
Pick and place? I need more coffee.
It does not. This card has voltage test points on the back edge.  While powered up, When I check the voltages this is what I get:

12V            good  This is only the power from the pci-e slot
3.3V           good  This is only the power from the pci-e slot
1.8V           good  Don't know what this is, suspect power for the PLX pci-e bridge chip
0.9V           good  Pci-e I/O
GDDR5        dead  supply for the 32 GDDR5 modules
gpu1/core   dead  This is the low resistance gpu  (1.0 ohm)
gpu2/core   good  This is the normal resistance gpu (2.5 ohm)
gpu1/mc     dead  This is the memory controller for the "bad" gpu (3.5 ohm)
gpu2/mc     good  This is the memory controller for the "good" gpu (35 ohm)

I suspect the start-up sequence this card uses is to power up the core and memory controller for each gpu first and then the memory due to it being shared.
Likely it's firmware goes:

gpu1-->good?-->no-->crowbar.
gpu2-->good?-->yes-->turn on power.
memory-->do not turn on due to gpu 1 crowbar.

Sound plausible?
legendary
Activity: 3094
Merit: 2239
I fix broken miners. And make holes in teeth :-)
I have two XFX cards.  A 280 and 280x that I got with dead shorts on the VDDC (gpu core) supply.  Turned out to be a smd ceramic cap on the back!
Found it by putting a D cell battery across the VDDC and looking at the card with a thermal camera.  The battery had just enough current to make the
bad cap glow, but not enough to damage anything.

Currently have a Devil 13 dual 290x that has some strange measurements.  One of the gpus shows much lower resistances than the other.  Looking like it might
have a fried GPU.  Do wish I could find pinouts or datasheets on these things.  I am going to try lifting the choke(s) to confirm where the shorts are.                         This card is absurd as far as the power goes.  It has 15 power phases at 40A each.  5 per gpu core, 1 for each gpu memory controller and three for the GDDR5.         That's 120A just for the memory!
Nice job! I've used a 30ah SAFT NiCD cell as a tester like that, enough current to heat up the short while keeping the voltage below what will blow up a hash engine or GPU. That second one could point to a bad high side FET, does it crowbar the power supply when plugged in by chance?
hero member
Activity: 650
Merit: 500
Pick and place? I need more coffee.
I have two XFX cards.  A 280 and 280x that I got with dead shorts on the VDDC (gpu core) supply.  Turned out to be a smd ceramic cap on the back!
Found it by putting a D cell battery across the VDDC and looking at the card with a thermal camera.  The battery had just enough current to make the
bad cap glow, but not enough to damage anything.

Currently have a Devil 13 dual 290x that has some strange measurements.  One of the gpus shows much lower resistances than the other.  Looking like it might
have a fried GPU.  Do wish I could find pinouts or datasheets on these things.  I am going to try lifting the choke(s) to confirm where the shorts are.                         This card is absurd as far as the power goes.  It has 15 power phases at 40A each.  5 per gpu core, 1 for each gpu memory controller and three for the GDDR5.         That's 120A just for the memory!
legendary
Activity: 3094
Merit: 2239
I fix broken miners. And make holes in teeth :-)
So back from my trip, finished the backlog of work that came in, and can fiddle with these boards a bit more.

First up is an AMD R9/270 that I picked up on Ebay broken. Sure enough it didn't come up as a display, but was registered by the computer. So something was working, just not the whole thing.

The board


It's really pretty simple: On the left is the high power circuits for the GPU, on the right is a lower power circuit pair for the memory, and hotel circuits.

If you look more closely at the left side you can see how this is powered: Five separate chokes indicate 5 power supplies. The little FQDN chips next to the chokes are the FETs plus drivers plus dead-time logic. The lettering on them is hard to read but I can see they are Fairchild 6705B half bridge buck drivers. They contain the switching logic, a high and low side FET, and appropriate logic to determine cut-through and current sensing (via the little RC circuits at the bottom of the board). Pretty simple actually, according to the docs each one can handle 40a of current, so we're looking at a max of 200a into the GPU. About right.

The question is: What is happening? Normally the high side FET shorts, in which case the cut-through circuitry crowbars the low side and shuts down the controller. The problem with that is if +12 was connected to the GPU the low resistance of the GPU would essentially short out your powr supply or blow the GPU sky high. Given that neither are happening I'm not sure if the failure is in the FETs. It's possible the low side FET blew, but since the Rds switching time is mostly on for the low side in a buck converter they don't usually ever short out. Plus the voltage drop on the high side is much higher (going from 12v to 1 instead of 1v to 0) so the high side FET normally blows.

Hm.... One way to find a shorted supply is to pull the chokes and check the resistance of the circuit at the output/FET side of the choke. The bad supply will read high (or low if the low side FET is shorted) and you're in business. Or if the FETs are exposed you can look for a short between gate and source or drain. Normally when a FET blows the gate is shorted as well.


Need to think more about this.
full member
Activity: 224
Merit: 100
CryptoLearner
Woah, true electronic repair guys are so rare nowadays  Shocked, keep up the good work man  Cool
legendary
Activity: 3094
Merit: 2239
I fix broken miners. And make holes in teeth :-)
Yup, BFL's had 6 phase power supplies which really reduced ripple but went batshit if the FET drivers (2708's in the old days) would short. Likewise you had an RC circuit across each inductor, this would tell you the current across the inductor and was compared with the phase position from the LM driver to adjust for FETs running hot or cold. Problem is if that RC circuit goes out of whack then FETs start exploding.

Oddly enough Titan/Neptunes did it right: They bought off the shelf supplies, synced them in pairs, then placed them on the board around the chip in an implosion design so that no die in the chip was ever further away from one supply than the other. This matters on a 6-12 phase system since the distance from inductor to die can vary by an inch and while an inch seems like a small distance when you're pulling 400 watts per die at .5v that's 800a of current and P=I^2/R so I gets very big even if R is very small.

I'm guessing a similar situation exists in the GPUs. Gotta run to Boston this weekend, if anyone's up there and wants to grab lunch let me know. Next week I'll start posting some pics of a blown 270 board I have here.

C
hero member
Activity: 650
Merit: 500
Pick and place? I need more coffee.
All of the AMD cards I have worked on have three dc/dc supplies on them.

1) 1.5V for the GDDR5. This one is usually fixed and consists of two phases.
2) 0.9V - 1.0V for the GPU memory controller I/O.  This is controllable via firmware and is often one or two phases.
3) 0.8V - 1.2V  for GPU core.  This is always firmware controlled and is usually at least 4 phases but some cards can have 10 or more. Shocked

I have seen shorts on all of these.  I have an Asus 280x that has all THREE shorted.  I have checked the gate of each mosfet and they are all good.
Thinking about pulling the chokes but when I apply current limited power to it and watch with a thermal camera the GPU die heats up.  Strange.

One of the cards that keeps popping mosfets is an older Nvidia 760ti.  This card always blows just ONE high side fet.  It will turn on, post then a variable
amount of time later (5-10 minutes) the PSU shuts down (shorted fet).  I feel the gate controller is to blame, could be wrong.   This card has 4 core phases and it is
not always the same one that pops it's high side fet.  This card also has the same core resistance as a working card. Huh
legendary
Activity: 3094
Merit: 2239
I fix broken miners. And make holes in teeth :-)
Will you be able to post pictures along with what you find when fixing the GPUs? It would be awesome to learn, if all goes well I'll try to find some dead cards for you to fix  Wink
Absolutely. I did that in my Titan and Neptune threads, it's fun to do.
newbie
Activity: 39
Merit: 0
Will you be able to post pictures along with what you find when fixing the GPUs? It would be awesome to learn, if all goes well I'll try to find some dead cards for you to fix  Wink
legendary
Activity: 3094
Merit: 2239
I fix broken miners. And make holes in teeth :-)
So in terms of GPUs and such I've noticed that there are not a whole lot of chips on a GPU aside from the main chip. So I am guessing the GPU chip pretty much has it all.

if I think about this like an SGI Indigo ELan or Reality Engine we have these parts in the video system:

Sequence command decoder
Raster memory and Display Generator
Z buffer memory and Texture memory
GE7 Geometry engine GPUs.

They require different power supplies: The decoders and memory are normal 1.8/3.3 volt systems that pull a small amount of current to serve as the hotel load. The real power is needed for the Geometry Engines, and even they had a normal supply for the sequencers and buffers with the big power reserved for the transformation engines, lighting, shading, and polygon calculations.

So if we have a board that appears as a device in Windows it's probable that the hotel circuits are running, but the vector processors are out which might be those power supplies. I'll start sorting boards based on if they come up at all, come up with bad screens (probably Z buffer or raster memory errors) or something else.

Hm.
legendary
Activity: 3094
Merit: 2239
I fix broken miners. And make holes in teeth :-)
You would think one could run diagnostics on the things to find the bad memory cards; those are easy to swap out but yes, a pain.

Have you tried checking the resistance with the inductors off? Back in the BFL days that was the #1 best way to identify which FET was shorted and also a way to identify a shorted die (0 ohms means infinite current no matter how you slice it).

C
hero member
Activity: 650
Merit: 500
Pick and place? I need more coffee.
Nice to see you working on GPU's.  I have a few stencils coming for tahiti/pitcairn/hawaii and I will take a crack at re-balling some of the units I have.  They look
like they use 0.5mm balls.  Can send you some of my "trouble" units if you want to try to fix them.  I have a few units that keep popping mosfets.  Have been trying
to find out a way to narrow down bad memory chips on cards.  Don't even know if this is possible without changing them one at a time.

Cheers!
newbie
Activity: 18
Merit: 0
legendary
Activity: 3094
Merit: 2239
I fix broken miners. And make holes in teeth :-)
I had a GPU die in quite a silly way, the PCI extender I was using (16x to 1x) I had the 1x plugged in the wrong way, and apparently this killed the card through an extender. If this is something you think you can fix, I will gladly send it to you for shipping cost.
Sure. I'll PM you my address.

C
legendary
Activity: 3094
Merit: 2239
I fix broken miners. And make holes in teeth :-)
Yes, I am in the US, feel free to PM me as needed.
hero member
Activity: 906
Merit: 507
I have a gigabyte r9/270 you can check out I lost the receipt to RMA but if your in the us I can ship it to you. I also have a gridseed blade that needs to be looked at
legendary
Activity: 3738
Merit: 1708
The most common failure with GPUs are the fans. Depending which type of fan it uses it can be repaired in different ways.

The Sapphire Dual-X R9 280X, Gigabyte Windforce fans, all have fan blades that you can easily pop-off using some string and then relube the bearing with grease. Works everytime pretty much.

The more durable fans like on the ASUS 7970 / ASUS 280x / MSI 280x you need to drill a hole in the back slightly off-centre and pour in the thinnest oil that can fit inside. This sometimes works great ... sometimes works but rattles.... reason being that lube would be best however its impossible to lubricate the bearing.

For the newer RX 470 / 480 the fans will probably start failing sooner or later however for those most have 2-3 year warranty and you can just RMA them.

 
Pages:
Jump to: