Dude, bitfury is the lead designer at their shop. If I know anything about the people of bitfuty's calibre I'll say that he will have driver's core already debugged on the simulator before the tapeout. This really isn't looking like another seat-of-the-pants outfit.
Yes. this is why I have not posted. Actually initially before successful tapeouts of ASICMINER and then Avalon I have thought that things that were told to me by experienced designers that basically for logic I'll get in hardware same as in simulation is not true - I simply was too conservative to trust such claims. So initially I have choosen to go cheaper technology node to perform tests and so core was implemented and optimized for 150nm node. But - AFTER other successful tapeouts and knowing that simulation match well hardware results for 130nm and 110nm I have changed mind and started to trust it.
So what simulations are performed:
1) Functional simulation - that is - that core as a whole works and computes correctly with correct timing in all corners (i.e. typical, slow, fast wafer);
2) Flip-flop setup/hold time simulation - this is more tricky to explain - but under no circumstances you should have fast-paths or clock skew that violates holds - THIS CANNOT BE FIXED BY LOWERING FREQUENCY, while setup violation can. This is simulated in all corners + monte corners simulations to understand how variations affect sampling + voltage variations due to CMOS logic power consumption; To clarify - fast-path - is wire between flip-flops where signal gets too quickly from one flip-flop to another - and violates HOLD requirement. HOLD requirement is time that signal should not change in flip-flop input during (and after) clock edge. This is typically very short period of time - but unfortunately if this is violated then design won't work even at 1.0 Hz. Clock skew - happens if clock distribution delays (and especially tolerances - as chip components are not precise inside) do not violate sampling time of flip-flops;
3) Power grid simulation - actually simple - to confirm that there's sufficient count of bypass capacitors and no parasitic resonances appear of magnitude that can affect hash core performance - that's it - unfortunately 26%-28% of DIE AREA is just capacitors
not transistors... not logic... that's big sacrifice and it won't be stable especially in low voltage without that... capacitors placed near flip-flops;
For 65nm unlike 150nm all design is static CMOS - i.e. dynamic nodes had to be removed, because of too high leakage.
What simulations are beyond of my experience now - is say yield prediction... This is quite speculative, but we expect not worse than 90% chips that works completely and maybe about 0.1% of chips totally malfunctioning. Also performance variation would be around +- 20% chip to chip.
Finally die dimensions chosen: 3.8x3.8mm
Package: QFN48
Performance: 3.3 GH/s _rated performance_, about 7 GH/s maximum
Power consumption: 1 W at _rated_ performance @ 0.6 V, 6 W _maximum_ performance @ 1.0 V.
Thermal characteristics of package: 2 K / W junction-to-pcb and 34 K / W junction-to-ambient.
So - at _RATED_ performance chip would work without any heatsinks or so - having 40 degrees in room, there would be 75 degrees in chip - very good - without fans, without heatsinks. (THIS IS PERFORMANCE THAT ACTUALLY ALL DEALS WITH THIS CHIPS ARE MADE).
However - MAXIMUM performance is not overvoltage - basically it is still in envelope of gate oxide reliability (oxide thickness is 20 angstrom - so you can apply 1.0 V for long longevity, 2.0 V for likely half year to one year operations).
So - with QFN48 it would be quite challening to get 6 W - as basically you would really work on chilling and likely use Al PCB. This is say up to Dave how to deal with that. In my experience this thing would be very labor-intensive... So maybe they would convert/upgrade equipment later, as they have cheap electricity. However do not take this as endorsment - I haven't worked with Dave yet and cannot guarantee anything except that chips will work. With chips - I can guarantee that it would work, with about 96% during first launch without delays, with 4% with delays + I provide for this mine backup chips from other vendor in case of unacceptable delays. So overall risks that this project would be without chips should be less that 0.1%. Rests of risks that you should asses - that building such MINES are not simple, for example to build single BitFury 110 GH/s rack - it took 2 weeks of labor. This is what should be prepared on side of Dave. Labor is expensive in US. Doing it on his own would take likely 4 weeks hard work to assembly it.
I've thought to post image of core, but looking that BFL posted core image with that black boxes put over there - I decided finally not to post and wait till their tape-out would be confirmed :-) I want to be absolutely sure that they have maskset
I didn't know that he's Polish or had choosen to work through a Polish scientific establishment. Through Europractice he should have no problem accessing the latest 40nm and 28nm processes at deep discount.
Choosing scientific establishment is very good for such small orders. Really - this thing is more research than production. Because I would like everyone to understand that what we do - i.e. $500k order or even $5m order is peanuts for foundries. Especially on smaller tech nodes. To get seriously treated as direct foundry customer you would either have to be somehow important to them or your order volume should be at least >200 wafers monthly (to be small but direct customer). that's more than even BFL with all of their preorders could sell :-) 3.3 _PETAHASH_ per month :-) Plus - wafers are cheap but assembly is not, and runtime is not as well. To explain why this is small - SINGLE SEMICONDUCTOR FABRICATION PLANT (i.e. single tech node _LINE_) produces typically 40'000 - 80'000 _WAFERS_ per month. With orders like 6 or 12 wafers without regular demand - you're too small. I really hope that some day demand for proof-of-work chips will be high and that mining devices would be available globally. But this is not today. today we're small, and with small steps we should go forward.
Waiting for a dev board so I can write the cgminer driver for you
Edit: of course, contact me if you want any suggestions about the MCU design (not doing the design, just optimal details of it's design)
Well. The chips will work in strings with SPI protocol using state-machine. This was tested and found to be nice in second generation of my FPGA boards. I.e. instead of device addresses I just have prefix code that triggers state machine of devices and allow to access chain. From software point of view - sending new jobs and getting results is just as feeding big buffer into SPI and simultaneously reading values where will be answers. This can be done in single thread very efficiently even on slow ARM CPUs.
The goal that I have is about single ARM cpu per 1200-1500 chips that's 3.6 - 4.5 TH/s. So the question is that code should be quite efficient to handle that. Also requests are double-buffered (this means that while one job is processed in chip, another job is pipelined). With ASIC unlike of FPGA job processing would take about 0.3 - 0.4 milliseconds of time. This means that there should be likely not less than one communication every 0.15 ms.
Last time I tried to adapt cgminer for that purpose for much smaller task - 24 spartans, I had to make 48 threads for double-buffering. And to me that seems as complete nonsense. As for 1200 chips it won't work (2400 threads).
Likely I plan that code structurally would look like asynchronous state machine for I/O with bitcoind/pool with protocol like stratum or Luke's getblocktemplate. Second thing - that job generation could be done quickly from template in synchronous fashion when making up request buffer to chips. Then separate thread for SPI I/O - i.e. prepare request buffer, spit it out to SPI while simultaneously reading back data, parsing answer buffer and either send updates to chip and send results to network. I think that cgminer codebase is not well-suited for that - a lot of work would be required to redesign. However cgminer's monitoring is nice compared to what I typically wrote :-)
PS.
2 BFL Trolls here - I first thought to troll you hard, but then ... eh... I have not so much time for this fun... Let me finish and get to tape-out, I would gladly troll you in my spare time :-) it's so fun :-) For now - better troll BFL to release chips :-) I already put bet with money against it :-) If you're so BFL-oriented - then - put 'Yes' for BFL - it would be better than trolling :-))))
http://bitbet.us/bet/7/bfl-will-deliver-asic-devices-before-march-1st/Hope you also understood how W / GH/s metrics works :-)))))) And how voltages works :-))) And that actually these claims of BGA packages because they can't fit in QFN with power requirements look like complete nonsence :-)))) As this is engineering choice actually - if you have same chip - what W / GH/s choose.... However they may be too greed to downrate chips and make them more stable...