Pages:
Author

Topic: Request for Discussion: proposal for standard modular rack miner - page 2. (Read 9681 times)

member
Activity: 116
Merit: 101
Initially I was thinking through hole style jumpers or zero ohm resistors with sockets or something physical that a non board level type person could manipulate, but why have a crapload of parts that you hope you never need to use. 

I mean realistically, you could just make a small "re work" section in the PCB art next to each asic that had comms lines readily available for an x-acto knife, and then just a couple of pads set up for a shorting Vcore to ground and rerouting the comms with a nice some nice crude solder blobs

Do the chips have any on board voltage regulation at all for the core voltage? Or do you need to drop your total input voltage, say from 12v to 11.2v?


legendary
Activity: 3416
Merit: 1865
Curmudgeonly hardware guy
Could work. If you're already doing board-level soldering, all I'd really need is jumper pads to reroute comms. It'd be best to pull the chips so you don't get comms interference, and once that's done you jumper from VCore to GND pads where the chip used to be. You could also lift the big node-level cap and jump across it. Pads are pretty much free, after all.
Actually, my two-chip "L board" for testing BM1384 is set up a lot like that. I have two power-input jacks, one for Vcore and one for 2*Vcore, and a set of five jumper pads to take comms from either a second chip at the same node (on Vcore) or a second chip at another node (at 2*Vcore). Heck, it's hooked up and running right now (http://eligius.st/~wizkid057/newstats/userstats.php/1BURGERAXHH6Yi6LRybRJK7ybEm5m5HwTr should be seeing 22GH, two chips in a string at 200MHz) so I know the concept is sound.

You'd have to remember to take your core voltage down by one node's worth or your remaining chips will run pretty hot.

On the S5, there are no VRMs at all. 2 chips per node gets them the desired power consumption and hashrate; it's a 30-chip board, just like the S1 had a 32-chip board. Using more chips per node increases your total current (which is one reason the S7 pulls around 400W per board versus the S5's 250W per board, because it has 3 chips wide instead of 2) but you also get better balance. If one chip is running a bit high but another on the same node runs a bit low, they kinda cancel each other out. It's easier to buffer out brief transients with a wider string because any one chip's ripples will be absorbed by the other two chips.
Wider nodes also reduces the number of level shifters required compared to the number of chips. Each node needs a level shifter to bring comm data up to its local ground reference (and other node-level parts for IO voltage and such). The S5 has 15 nodes and 30 chips, so 1 shifter per 2 chips. The S7 has 18 nodes and 54 chips, so 1 shifter per 3 chips. When considering the cost of parts that aren't directly increasing your hashrate, you want to maximize the ratio of ASICs to non-ASICs. This means more ASICs per node.

The optimal from that criteria would be to have all chips in one node, but then you have a VRM design and you start factoring in the relatively high cost of VRMs. The more chips you have per VRM the better, so things like the S1 were okay. The standard chip for VRMs has been the TPS 53355 which has a maximum current output of 30A, which is great for higher-voltage lower-current chips like the BM1380 on the S1, but not so great for the low-voltage high-current BM1384. At top clock, a '55 could power two chips. At midrange setting (say, 275MHz - 15GH/5.4W per chip) you could just barely run four but it'd probably catch on fire if your ambient was warm. The S5 would have needed 15 VRMs (at probably $5 each minimum) per board to run the same hashrate and those VRMs would have decreased the system efficiency by between 10 and 15 percent. By going string on the S5, Bitmain ended up saving $50 in VRMs by adding about $15 in additional node-level parts, and increased the board-level efficiency by at least 10%.

If you can keep the chips in recommended temperature and voltage range, I'd think the odds of failure would be pretty low. If an auto-rerouting system cost $10 for an otherwise $150 board, you'd want a probability of board failure greater than 10/150 or 6.7% for it to really be feasible in the long run. I highly doubt the odds of board failure are that high or we'd be seeing a lot more threads yelling at Bitmain. I'd be surprised if the odds of Prisma board failures was even that high, and those were famous for spontaneous (and often dramatic) death. I ran 44 boards for six months without any failures.
member
Activity: 116
Merit: 101
So I assume the 2 chip per node layout is used to reduce part count on the VRM side?

I wonder if doing 1 chip per node and some fet protection circuitry on every chip would increase costs substantially?

Furthermore, it sounds like this is pretty much only relevant to a new chip design with parallel comms.  Unless of course you ran multiple strings on one board and called each string a separate device, again raising complexity and parts count. 

Any idea what the probability of failure actually looks like for these chips?

Lastly, not sure on the cost difference and board layout implications, but you could also offer this as a manual repair solution.  Simply place bypass jumper headers near each chip.  In your example, with 2 chips per node, failing closed would result in the auto voltage reduction at the source, but failing open would take down a string/board.  You could simply pull the board, adding a jumper to force the node closed.  Low tech but it gets you hashing again quickly.



legendary
Activity: 3416
Merit: 1865
Curmudgeonly hardware guy
I guess the bypass FET only works if the chip fails open or stops hashing - conditions where current passing through the chip are substantially reduced. If the chip burns up and fails short, the voltage across that node is effectively zero.

Let's use S5 voltage levels as an example. It's easy. There are 15 pairs of chips and your total voltage is 12V, so each pair (the chips in each pair are in parallel with each other) sees 0.8V (12V/15). Each chip draws 12A at 0.8V to get 22GH.

Say one chip fries short; now instead of passing 12A at 0.8V alongside its partner (also passing 12A at 0.8V), the fried chip passes 24A at approximately zero volts. You now have effectively 14 pairs, so your per-node voltage increases from 12/15 to 12/14 or 0.86V. If you detect a condition like this, and you can control your total voltage, you can lower it to compensate. If your chips don't relay comms from one to the next, the rest of the chips might keep working without a hitch.
However, with Bitmain chips, if the chips drop out enough to no longer communicate properly, they can't relay work to other chips and they'll turn off. This upsets the current balance of the whole string and it stops working.

Say one chip fries open; now instead of passing 12A at 0.8V alongside its partner, the fried chip passes 0A at 0.8V. The circuit as a whole still needs to pass 24A for the other chip pairs to work, so the voltage at the damaged node will increase until the one functioning chip is passing 24A - either because now its core voltage is so high it's actually drawing that much current to operate, or (more likely) because it fries. If you have a bypass FET, when it sees the node voltage increase (because the apparent resistance suddenly doubled by taking out one chip, remember V=IR) the FET will kick on and start sinking 12A. This will bring the total current back up to 24A without breaking any more chips, and the node voltage will return to the 0.8V it's supposed to be at. Instead of two chips each hashing at 12A, you have one chip hashing at 12A and one as a dummy load at 12A. Again, if your chips don't relay comms the rest of the chips will keep working, but if your down chip can no longer relay, all the downstream chips it should be feeding work to will turn off and the string stops working.

The on resistance of the MOSFET is actually the part being actively controlled by the op-amp circuit. It'll be operating in the linear region (instead of switching between off and saturated) as a high-power variable resistor. The power loss through the FET will be the same as the power loss from a working ASIC. It has to be so, because for the rest of the system to work properly, there needs to be a chip there sinking 12A at 0.8V - either an ASIC, or a dummy load.

If you can actually reroute chained comms around bad nodes entirely, you could use a switched FET to draw all the current through a node and drop its node voltage to near zero (Inode*Rdson) which would basically turn off both chips whether one still worked or not. You'd then need to drop your total voltage by one node's worth.
For example, say you have six chips in three nodes: A+B, C+D, E+F. Comms go A->B->C->D->E->F. If chip D suddenly drops out, you can bypass the entire node by shorting it out (which turns off C+D) and redirecting comms so you now have A->B->E->F. It's definitely possible, but not really easy. You'd have to have a way of very rapidly determining which node wasn't functioning right (which I guess could be done with node voltage threshold measures - if it gets too high or too low, disable the node and latch it off), and then rig up multi-channel switches on every node. If bad-node voltage was almost zero, the level shifters from the last working node would still work for the next node up, but you'd need a set of dual-throw switches on every node. Possible, but cumbersome. Course now I'm interested in actually designing it and seeing how bad that would be. If it's only a matter of a couple bucks per board it might be worth putting in, since you'd be chopping a lot of the more probable board-failure conditions to reduced-capacity condition instead.

I guess the feasibility of the system is entirely dependent on the probability of chip failure which disables the entire board without the system. Economically, if the percent increase in the cost per board is greater than the probability of the failures it prevents, it's actually a net loss for the customer. On average.
member
Activity: 116
Merit: 101
Interesting...

So help me get on the same page regarding the basic power distribution of these systems.  You have your PSU, which outputs some voltage, nominally 12v.  Some supplies may have the ability to tweak this voltage slightly?

On an S1 this 12v goes to a VRM (a buck converter?) that drops the 12v to your core voltage, which is then seen by 8 chips in parallel, so if any one goes down, the other 7 chips would still see power, unless the chip fails in such a way that it short out the supply, and then you loose all 8 chips.

In a string topology, I assume the chips are powered in series? And this is where your "total voltage" comes from, which is where you would use the voltage divider to get your 1.1*v_core_excpected?  Is the on resistance of a MOSFET significant at these levels and hence the power loss through the MOSFET?  I guess I have a hard time understand why you can't just use your voltage control to lower the total voltage across the string by 0.7v or whatever node voltage.  Then you bypass the chip and the rest of the string keeps working?  Please pardon my lack of knowledge here, I'm just trying to wrap my head around the whole thing. 

legendary
Activity: 3416
Merit: 1865
Curmudgeonly hardware guy
One thing that could be implemented to maintain a string is I think something ASICMiner attempted on the Prisma - basically, FET bypass. You could put basically an analog voltage-detection circuit on each node that monitors the chips' core voltage per node and drives a bypass FET in parallel with the chips to maintain that core voltage. I'd probably use a simple op-amp circuit with a threshold reference of about 1.1* an externally-generated Vcore (basically, a fixed resistor divider across your total voltage giving you expected per-node voltage plus ten percent) driving the FET. If a chip drops out, the current through that node drops and the voltage spikes briefly - it'll be partially buffered by your node-level caps. If the op-amp detects the node-level voltage goes up to 110% of expected value, it'll kick on the FET which will start to draw excess current through it and buffer back down. The issue with this is your bypass FET is now sinking the entire power typically burned by an ASIC, meaning you're still running the same amount of power but doing 1 chip's-worth less work.

That system worked for ASICMiner because of how the comms worked on the BE200. Each chip was on a buffered parallel bus with individual addressing. One chip turning off didn't affect the operation of any other chip, at least directly. It wouldn't really work as well for BM1385, depending on the failure condition. BM1385 (if I understand correctly, and it was the case for BM1380, 82 and 84) relay comms, so the first chip talks to the second which talks to the third and so on. If one chip drops out in a way that it stops relaying comms, everything downstream from it will also turn off. A bypass FET to pick up the slack of a downed chip is okay if the chip is still talking but not hashing; in the event of a downed chip not talking, everything downstream will stop hasing and every node will be in full bypass mode - every FET will have to sink the power of all the chips on that node, and probably burst into flames pretty quickly.

That's one reason I'm hoping, if PlanetCrypto can get a chip dev project going, he'll go with parallel comms instead of relayed. It'll require a bit more work for the node-level comms shifting, but by fully parallelizing everything you increase overall reliability. If you were running an S1 (definitely not string) and you blew the first VRM, everything would stop hashing because the first 8 chips turned off and now you couldn't communicate with the last 24 which were still working perfectly. Compare that to an AM Tube, where you can smoke as many VRMs as you want and everything else keeps going.

You might ask PlanetCrypto what he has access to for simulation packages. Novak's last job centered around doing CFD for jet turbines so I know he knows a thing or two but I really don't know anything myself.
member
Activity: 116
Merit: 101
Work has picked up a bit as I get ready to enter port and pack up all our gear. 

I will have to investigate what simulation packages I have access to through our license pool, but I have a feeling to do this right I might need to get some third party software like COMSOL or FLUENT and export our models geometry into one of those multiphysics packages.  I really hesitate to post anything I have from the "express" simulation packages I have on board as I doubt their completeness (I get outputs like average velocity  over the entire flow path, as opposed to seeing turbulent pockets etc). 

I just don't want to promise the world regarding simulations and then not deliver, as there is both the issue of gaining access to a reputable CFD/multiphysics package, as well as the learning curve associated with getting useful data out of it. 

I will attempt to get another model rendered up reflecting the latest ideas before I get into port tomorrow. But it will probably be a week at least until I have a better handle on what I can actually produce simulation wise. 

As for the density issue, I agree with you sidehack.  Reliability is key, and furthermore, I like 10 x 10watt chips alot better than 1 x 100watt chips.  On that note if a chip goes down in string topology, does that mean the entire string MUST go down?  Or are their any possible failure modes that don't result in loosing the whole string? 


legendary
Activity: 3416
Merit: 1865
Curmudgeonly hardware guy
I do a bit of advising for a guy in a somewhat equatorial region, who bought several SP35 earlier this year. He set them up in a climate-controlled room with AC venting directly into their intakes and still can't sustain more than about 90% rated speed from the lot. Meanwhile it's not too difficult to get 20% over rated speed out of an S5 without any problem. Yeah I'd say there's something wrong with their rating system.

For the machine design being proposed here, I'd like to see at least 260W per blade stock rating sustainable in 30C ambient. It'd be nice if it could handle 320W per blade without a lot of danger. I know some of the danger is going to depend on the board design itself, low-resistance chip-to-sink connection and such, but a lot of it is also the design of the heatsink itself. I'm looking forward to seeing some of Witrebel's simulations. If we can get the basic layout hammered out, the next step is building a sturdy heatsink which can perform inside the specified dimensions.
legendary
Activity: 1022
Merit: 1003
I feel that a large part of that problem is their rating as well as the design.  Bitmain rates miners at a reliable level of clocking, which can be maintained more or less indefinitely.  SP-Tech -especially notable with their SP20- have ratings that for all intent and purpose were un-achievable, and completely high strung.
legendary
Activity: 3416
Merit: 1865
Curmudgeonly hardware guy
They're so dense that it's difficult, if not impossible, to keep one running at top speed - which is to say, advertised rated speed. The overall complexity of their designs is definitely pretty bad. I mean I like that they give you plenty of information on what power is going where, but 200A 4-phase bucks into BGA chips with a custom protocol in an FPGA on the control board is not a recipe for reliability even if your stuff doesn't run 110C.

That's why this proposal intends to keep things relatively simple. Increasing complexity simultaneously increases both the odds of failure and the difficulty of maintenance - not to mention the initial cost.
legendary
Activity: 1666
Merit: 1185
dogiecoin.com
I tend to think of Spondoolies first when I think of power density, and I've never had one run nearly as well as advertised because they were so finicky about heat. Add the order of magnitude complexity increase because of their 100W BGA idea to insane internal temperatures and they're actually sorta terrible. There comes a point where density murders reliability, and reliability is one of my primary design requirements.
I don't believe that any of their problems are due to heat at all, if ambient is high the fans scale or the frequency / voltage scales down. The real problem is the custom controller's high failure rates out of warranty, the entire setup is crazy complicated.
legendary
Activity: 3822
Merit: 2703
Evil beware: We have waffles!
My take on the s7 and then 'nuff said about it:
Ya I will get at least 1 after they start delivering and we hear more about them. As to their pricing... well my first s2 was around $1,250 as I recall and the 1st s4's the same so, not too terrible for first release.

I will probably finally retire my last s2 that's on-line and few S3's to free up power for the s7(s). Also have 2 s4's that are starting to act up by dropping boards from time to time...All in all it is a very good balance on power/replaced hash rate but kills me to take ANY perfectly good miners offline Sad

On the hot turbines -- that exhaust fan will need to to watched as most are NOT happy in the hot side and their bearing life tables reflect that. They should start looking at selling the s7 in pairs and use a dual-squirrel cage blower for them. I happen to have one in the shop and will do just that Tongue
We now return to our regular programming.
legendary
Activity: 3416
Merit: 1865
Curmudgeonly hardware guy
You're right, it's not relevant. It's for the customer to figure out.
legendary
Activity: 1022
Merit: 1003
Common in kitchens. Not so common elsewhere. Also, I'll point out that 24A at 110 is easily handled by *2* circuits, most common US house outlets being on at least a 15A circuit. *2* circuits in a room are more common, even some of the MOBILE HOMES I've owned came with a seperate circuit on each side in at least the living room, going back into the 70s when I started paying attention to that sort of thing.

How do you propose to power 3x PSU's evenly between 2 15A circuits?  2x PSU's on one circuit would make for 16A, and I'm not aware of a legal/safe way to balance 3x loads between 2x 110V circuits.  Not that this is relevant, but for discussion's sake.
legendary
Activity: 3416
Merit: 1865
Curmudgeonly hardware guy
I tend to think of Spondoolies first when I think of power density, and I've never had one run nearly as well as advertised because they were so finicky about heat. Add the order of magnitude complexity increase because of their 100W BGA idea to insane internal temperatures and they're actually sorta terrible. There comes a point where density murders reliability, and reliability is one of my primary design requirements.

If we can safely put more heat into this machine (on pure aircooling) without losing reliability or becoming a fire hazard, I'm okay with that. Waterblocking is a secondary concern, but also one of the reasons to go with S1 board size - any C1 waterblock or aftermarket watercooler would drop right in and not require a new design there. That product already exists and is pretty good. Home mining is also a secondary concern so noise is a secondary concern - but it shouldn't be deafening either. 80dB fans is kinda stupid any way you look at it and pushing mechanical components to that extent probably sorely limits lifetime.

If the portion of panel over the rear of the cards is shorter than the gap between the cards and front fans, a board could be unscrewed, slid forward and lifted out without requiring anything else to be moved or disassembled. That's great. Keeping that rear portion in place (the angle and under the heatsinks) makes sense too as it keeps airflow pulling toward the fans without a lot of turbulence around the supplies.
member
Activity: 116
Merit: 101
Looking at the flowpath, I would advocate for at least the rear portion of the ducting, perhaps including the angled bit, and ending just at the back of the hashing cards, or maybe extending an inch or so over them.   I feel like this would help the flowpath significantly, but this is totally based on intuition and opinion, and I would strongly recommend we attempt some actual CFD simulations before settling on the exact flowpath layout.   High level though, I would say you could loose the panel over the cards.  I don't know how much those PSU's need to breathe, so its hard to say if they could pull all their intake through the side space, but it doesn't seem impossible. 

I also think it depends what sort of negative pressure the 3 120mm fans end up creating inside.  You wouldn't want that negative pressure fighting your PSU fans too much, but with the entire front face meshed and open, I doubt that would be a serious issue. 



As for the S7, the overpriced part is true, and while the jet turbine part is also true, I think it depends who you are building this miner for.  And lets not forget the fact that you aren't just designing a miner, you are designing a standard/form factor that is to be backward/forward compatible.

If you are designing for the home mining guys who want a few of these in the room or a room over from where they sleep, with wives and such, then yeah, jet turbine solutions don't cut it. 

But if you are designing this towards the economies of profitable bitcoin mining, then I think some allowances need to be made in regards to noise.  I am beginning to see the light about denisty, in terms of hash power per unit of volume. Chip efficiency held constant, the only way to increase this is increase the thermal capacity of the unit.  Which means either waterblocking, immersion cooling, or more airflow/better heat sinks.  Not saying that 2.2kw in a 4U case isn't dense enough.  But it certainly isn't as dense as the competition is trending towards. 
legendary
Activity: 3416
Merit: 1865
Curmudgeonly hardware guy
Okay. Bitmain built another overpriced jet turbine and now there are even more perfectly good S[odd] chassis about to be retired.

My opinion is, 2x the power in S5 volume is asking for trouble - especially if it has the same firmware bug of excess heat generation on network dropout. Instead of melting a bit it'll probably go full Prisma.

What do you suggest we do differently in a rack machine because Bitmain built a 1.2KW "home use" miner with no innovation, dual 4200RPM fans and three hundred tiny glued-on heatsinks? Overall, I'm fairly disappointed.
legendary
Activity: 4354
Merit: 9201
'The right to privacy matters'
Okay the ant miner s-7 dropped today.

Specs are unreal 4800 gh at 1200 watts

Box looks like 1/3 of an s-5+

I will suggest everyone look at it closely then come back to this thread.

This  s-7 is like the size of a s-5 and do more then 4x the hash. For 2x the power
legendary
Activity: 3416
Merit: 1865
Curmudgeonly hardware guy
Is the panel above the hashboards necessary? I can see how it'd be undesirable to suck in hot air from the hashboard exhaust into your PSU for "cooling". I ask because having a panel there will make accessing cards more difficult. The panel would have to be cut short or punched out for power and signal connector access, but now you have a lot more disassembly required in order to get to your hashboards for installation or maintenance. If it's necessary it's necessary (and given the kind of heat those cards will be generating, it's probably necessary) but that's going to make other things more difficult.

Could we make a kind of duct enclosing the backplane - maybe even open at the top for ready access but more-or-less sealed by the case lid - through which the PSUs can draw fresh air in through that two-inch side space instead of from the hashboard airflow? You could isolate the PSU intake from the hashboard heat without requiring a separator panel above the hashcards, which will keep access to them pretty easy. You wouldn't have to worry about getting your cabling out of the way before removing the separator to get to your hashboards, which might be handy if you need to do something to one board without shutting down the other seven. You also aren't limited to keeping your PCIe jacks in a predefined area of the board to be accessed through fixed punchouts.

That two-inch controller space can probably be shrunk without causing a lot of problems. If a small SBC were mounted vertically and the USB/interface board put above it (basically, both boards flat against the side wall) that space could probably be more like one inch. Some room for ventilation would need to be left open if this was providing cool air intake for the PSUs.


Load-balancing redundancy is indeed a pain for supplies not designed for it. If it's possible to allow load balancing for supplies with internal support for this, without making the option required (which means also allowing split-rail operation of non-balancing supplies) and also without requiring a lot more expense or complexity, I'd like to keep it possible. If it's too cumbersome to rig up a setup allowing for both configurations, I think split-rail should be kept instead of mandating a match set of load-balancing PSUs. However, probably some refinement on the two-point cabling idea presented earlier should allow the end user to decide between split and common rails without requiring complex circuitry.
legendary
Activity: 1498
Merit: 1030
It's only common at the breaker box and in the main supplying the box, NOT in the home itself.
 NOT the same thing.

 Having does more than a little rewiring over the years, I was FULLY aware of how power commonly arrives at the breaker box - but you can't plug a miner into a breaker box much less a main supply.

And how many typical homes have rooms wired for a >24A load between 3 circuits using standard 110/120V outlets?


 Common in kitchens. Not so common elsewhere. Also, I'll point out that 24A at 110 is easily handled by *2* circuits, most common US house outlets being on at least a 15A circuit. *2* circuits in a room are more common, even some of the MOBILE HOMES I've owned came with a seperate circuit on each side in at least the living room, going back into the 70s when I started paying attention to that sort of thing.

Quote

My point is that 240V can be had for those serious enough


 Never disagreed with that point, just pointing out it wasn't common in homes without rewiring work.
 Kinda redundent debate at this point, sidehack having mentioned he has no interest in this being intended as a home miner.



 
Quote

I focused on trying to maximize the PSU intake tract


 1u x 2u = 1.75" x 3.5" (actually a hair less due to case thickness). Figure if your "track" is 1" before you widen it, need 5.25" or so worth of width to feed each PS.




 Load balanced redundancy is a PITA to try to set up with PS that are not specifically designed for it. Takes a fair bit of additional circuitry to make it work reliably. I'd skip the whole idea.
Pages:
Jump to: