SHA256d IC design question | Bitcointalksearch.org

2112

legendary

Activity: 2128

Merit: 1074

Quote from: NODEhaven on March 28, 2018, 09:51:12 AM

This one is interesting. Not sure if their exact approach applies, but this 2017 paper shows power and efficiency improvements on the order of 300% for a modelled cryptographic implementation using asynchronous clocks.

https://www.sciencedirect.com/science/article/pii/S2090123217301170

This paper is about a way of implementing GALS (Globally Asynchronous Locally Synchronous) logic. IMO it won't help at all for SHA256D. It could probably work for something like scrypt() which is a sandwich of two layers of PBKDF2 with Salsa20 in the middle. Fixed length SHA-256 like in Bitcoin is too trivial to be susceptible to such advanced optimizations.

NODEhaven

jr. member

Activity: 58

Merit: 12

Quote from: 2112 on March 27, 2018, 12:17:55 PM

Can you make an educated speculation as to what would be the underlying business strategy?

What is the technical merit of multiplying the complexity of the engine several times in exchange for the gains lower than the regular manufacturing tolerances?

Lots of ASICs get designed purely for non-technical reasons: copy protection, hiding of patent or license violation in a way that is extremely hard to reverse-engineer and litigate, etc.

I think there should be some constructive speculation that you could post without violating your NDAs, don't you think? Or maybe everyone at your company already knows that HyperMega is a pseudonym of their 1st VP of Sales, and everyone there already watches your back?

I think the answer is more simple. The originators of the project, if not engineers, may have seen any "easy" answer with ASICboost. If the BPDL was always planned by Halong then that also answers the question as it requires everyone using ASICboost to release all patents and purchase rights for any IP that is licensed from third-parties for everyone in the BPDL.

https://blockchaindpl.org/licensev10

As far as optimizations of layout go, am looking at some different methodologies.

This one is interesting. Not sure if their exact approach applies, but this 2017 paper shows power and efficiency improvements on the order of 300% for a modelled cryptographic implementation using asynchronous clocks.

https://www.sciencedirect.com/science/article/pii/S2090123217301170

2112

legendary

Activity: 2128

Merit: 1074

Quote from: HyperMega on March 27, 2018, 10:43:39 AM

A professional marketing guy? No, I’m not that kind of professional.

But you are some sort of an insider, with access to the information about the most current tools and processes. Why don't you use to produce some useful information, even if you don't have funding for the fully separate mask set production?

Why don't you actually run some representative simulations, even some reduced-rounds non-compatible version of SHA-256?

Why can't you squeeze a single Bitcoin mining engine into paid-for but left unused free space on some unrelated project taping out? The same way that Heveticoin did, and in the spirit of the long history of placing various more-or-less useable Easter Eggs into established silicon products?

Quote from: HyperMega on March 27, 2018, 10:43:39 AM

It would even be possible to do a complete power shut-off of the unused backup logic in ASICboost mode, to avoid the leakage of these logic parts. But it still consumes silicon area, which increases your production costs in terms of $/GH.

Halong has chosen a very aggressive way to implement ASICboost without any backup logic for a non-ASICboost mode. In this way they have enabled the full potential of ASICboost in terms of J/GH and $/GH.

Can you make an educated speculation as to what would be the underlying business strategy?

What is the technical merit of multiplying the complexity of the engine several times in exchange for the gains lower than the regular manufacturing tolerances?

Lots of ASICs get designed purely for non-technical reasons: copy protection, hiding of patent or license violation in a way that is extremely hard to reverse-engineer and litigate, etc.

I think there should be some constructive speculation that you could post without violating your NDAs, don't you think? Or maybe everyone at your company already knows that HyperMega is a pseudonym of their 1st VP of Sales, and everyone there already watches your back?

HyperMega

full member

Activity: 129

Merit: 100

Quote from: 2112 on March 26, 2018, 03:13:09 PM

It is quite an achievement in marketing to squeeze 3-way deception into a single sentence. You must be a professional.

Finally, whatever one can say about Bitmain's chip that is ASIC-boost capable, at least it is somewhat honest in implementing switchable levels of ASIC-boost. One could actually measure the actual gains or loses from various levels of boosting and compare them with the table of theoretical values. It isn't as perfect an experiment as designing separate chips for each level of boosting, but a better scientific compromise.

All I can guess about Halong's chip is that it's design was worked out as some sort of political compromise or attack/defense strategy. I'm definitely not up to speed on the factions currently involved in the Bitcoin internecine warfare.

A professional marketing guy? No, I’m not that kind of professional.

It always takes me a while to extract the useful information from your posts, but believe me, I finally agree with you, sometimes.

Yes, having the ability to switch ASICboost on/off (as Bitmain did), give you a chance to compare the two modes. It would even be possible to do a complete power shut-off of the unused backup logic in ASICboost mode, to avoid the leakage of these logic parts. But it still consumes silicon area, which increases your production costs in terms of $/GH.

Halong has chosen a very aggressive way to implement ASICboost without any backup logic for a non-ASICboost mode. In this way they have enabled the full potential of ASICboost in terms of J/GH and $/GH. I wouldn’t dare something like that, without the support of parts of the community (e.g. Slush). The risk of falling down to only 25% of the maximum performance would be much to high, in case no pool would support rolling versions.
So yes, I agree, it was “a sort of political compromise or attack/defense strategy”.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: HyperMega on March 26, 2018, 02:15:38 PM

These numbers are not based on completely ideal assumptions. They are based on the fact that the part of the pipeline, which outputs could be reused by other cores, counts for about 25% of the overall core logic of a single core.

Ok, you are right, the FO/load cap of the reused bits is increased by feeding multiple cores. But the reused outputs are only 32 bits in contrast to a 512 bit wide pipeline without increased FO, implemented only once.

I haven't read the full patent application, but I understand how they are written with a goal of withstanding claim/counter-claim adversarial legal system in the USA and other anglophone countries. So I can confidently repeat: you are wrong, these numbers intentionally use idealized, abstract algebraic models to make a strong patent application. The whitepaper is just a marketing brief for the patent. This isn't a scientific report in the applied science field.

In the next paragraph you use the term "512-bit wide pipeline". This is just such a nice marketing speak. SHA256 is actually a 16-stage 32-bit wide shift register with some fancy feedback terms. The re-invention of it as 16*32=512 bit vector pipeline is nothing more than a workaround in for the bugs/design flaws in the front-end Verilog tools used preferably in the West Coast of the USA. If the design was done in VHDL (as preferred by East Coast USA boutiques) there would be no need for that trick of making 32-bit slices out of 512-bit vector.

No matter which front-end was used the actual physical layout is very far from the neatness associated with the word "pipeline" and how e.g. AMD/Intel use it in theirs marketing literature and die photos.

The physical layout of such designed unrolled mining engine very much resembles the snake pit like one used in my avatar. That happens because the heuristic layout optimization tools cannot find any useful gradient to optimize for, fail to converge or converge extremely slowly resulting with semi-random rats nest of long traces.

Quote from: HyperMega on March 26, 2018, 02:15:38 PM

So the gain of an ASICboost duo-core in terms of power efficiency will be a bit less than 12.5%, but not much.

I cut this paragraph into a separate quote because it is a beautiful sample of USDA prime marketing baloney.

Firstly duo-core was just a sample on the whitepaper, the Halong's implementation is quad-core. So it is 18.75% not 12.5%.

Secondly, you use values of bit much less than 2. Such a nice English creative writing trick. How do you values of "bit" compare with manufacturing tolerances which are about +/-20%?

Thirdly, it not about just (A) reduction of power use. You neglected to mention:

B) lower clock speed due to need to keep nearly four times larger area that needs to be kept in lockstep;
C) lower yield because the area of mutually dependent logic is increased nearly four-fold.

It is quite an achievement in marketing to squeeze 3-way deception into a single sentence. You must be a professional.

Finally, whatever one can say about Bitmain's chip that is ASIC-boost capable, at least it is somewhat honest in implementing switchable levels of ASIC-boost. One could actually measure the actual gains or loses from various levels of boosting and compare them with the table of theoretical values. It isn't as perfect an experiment as designing separate chips for each level of boosting, but a better scientific compromise.

All I can guess about Halong's chip is that it's design was worked out as some sort of political compromise or attack/defense strategy. I'm definitely not up to speed on the factions currently involved in the Bitcoin internecine warfare.

HyperMega

full member

Activity: 129

Merit: 100

Quote from: 2112 on March 26, 2018, 12:27:02 PM

All the numbers in that paper are theoretical values assuming infinite speed of light and counting of ideal logic gates with no parasitic impedances, infinite input impedance and zero output impedance.

That has no bearing on any actual implementation in any realistic logic circuit technology. In particular even non-ASIC-boosted but unrolled SHA256 has same values used in 16 different places. This implies https://en.wikipedia.org/wiki/Fan-out of 16 when nearly all CMOS processes are optimized for fan-out of 4 https://en.wikipedia.org/wiki/FO4 .

The FO4 argument probably explains why that chip is built with fixed 4-way ASICboost.

These numbers are not based on completely ideal assumptions. They are based on the fact that the part of the pipeline, which outputs could be reused by other cores, counts for about 25% of the overall core logic of a single core.

Ok, you are right, the FO/load cap of the reused bits is increased by feeding multiple cores. But the reused outputs are only 32 bits in contrast to a 512 bit wide pipeline without increased FO, implemented only once.

So the gain of an ASICboost duo-core in terms of power efficiency will be a bit less than 12.5%, but not much.

Moderator's note: This post was edited by frodocooper to remove a nested quote.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: HyperMega on March 26, 2018, 10:39:46 AM

Please have a look at page 8 of the original ASICboost white paper:
https://arxiv.org/ftp/arxiv/papers/1604/1604.00575.pdf

Ck said in another thread, that the Halong miner is at 25% of its performance in a non-ASICboost mode. Because of that I would assume, that they implemented a Quad-Core, which requires about 18.75% less silicon area (leakage power)/logic toggling (dynamic power) compared to 4 non-ASICboost cores.

All the numbers in that paper are theoretical values assuming infinite speed of light and counting of ideal logic gates with no parasitic impedances, infinite input impedance and zero output impedance.

That has no bearing on any actual implementation in any realistic logic circuit technology. In particular even non-ASIC-boosted but unrolled SHA256 has same values used in 16 different places. This implies https://en.wikipedia.org/wiki/Fan-out of 16 when nearly all CMOS processes are optimized for fan-out of 4 https://en.wikipedia.org/wiki/FO4 .

The FO4 argument probably explains why that chip is built with fixed 4-way ASICboost.

HyperMega

full member

Activity: 129

Merit: 100

Quote from: NODEhaven on March 20, 2018, 09:01:35 PM

How does the overt ASICboost that Halong is implementing effect the logic on the chip?

Please have a look at page 8 of the original ASICboost white paper:
https://arxiv.org/ftp/arxiv/papers/1604/1604.00575.pdf

There is a Duo-Core ASICboost implementation shown. In case you would operate such a Duo-Core in a non-ASICboost mode (at the same clock frequency), you would run at 50% of the ASICboost performance, because only one of the two cores can operate in non-ASICboost mode.

Ck said in another thread, that the Halong miner is at 25% of its performance in a non-ASICboost mode. Because of that I would assume, that they implemented a Quad-Core, which requires about 18.75% less silicon area (leakage power)/logic toggling (dynamic power) compared to 4 non-ASICboost cores.

Moderator's note: This post was edited by frodocooper to trim the quote from NODEhaven.

NODEhaven

jr. member

Activity: 58

Merit: 12

In a rather striking coincidence, I may not be the only one looking at this exact same sensor at 7nm. I contacted Moortec to learn a little moor about their sensor.

http://www.moortec.com/blog/2018/03/05/moortec-providers-of-in-chip-monitoring-pvt-subsystems-solutions-are-pleased-to-announce-that-canaan-creative-have-employed-moortecs-in-chip-monitoring-subsystem-their-hpc-ic

http://www.moortec.com/blog/tag/7nm

Moderator's note: This post was edited by frodocooper to correct erroneous URL formatting.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: NODEhaven on March 24, 2018, 02:10:53 PM

Right now I looking into on-chip temperature sensors and voltage regulation to use in a feedback loop that may require outside IP if feasible which would require a license. Those licenses may forgo the ability to use the non-profit approach.

Something I pulled up after a quick search. It has digital output. Not sure if that is an issue and how to calibrate it.
https://www.design-reuse.com/sip/temperature-sensor-series-6-with-digital-output-tsmc-7nm-ff-high-accuracy-thermal-sensing-for-reliability-and-optimisation-ip-43229/?login=1

Don't make a mistake of putting nontrivial control logic onto the same chip as the mining circuitry. In case of failure you won't be able to distinguish between the real fault or bogus fault induced by the noise and/or heat from mining logic. By definition the mining logic has to work at the edge of starvation or hyperthermia death, otherwise it is operating far from optimal.

Helveticoin did something like you are thinking (including an on-die ARM controller) and it was completely non-competitive. It had to be severely underclocked to maintain the reliability of the controlling SoC.

Spondoolies included on-die power-on-self-test and then had to create software workarounds for mining engines that fail the POST but operate correctly after a warm-up. Some desperadoes resorted to preheating their miners with a hair dryer.

You'll be much better off with just temperature-sensing diodes or averaging multiple low-accuracy temperature sensors located in far-away corners of the die.

NODEhaven

jr. member

Activity: 58

Merit: 12

Quote from: 2112 on March 16, 2018, 11:17:53 AM

"sandbagging" means that they used quite large factors of safety in their design ( https://en.wikipedia.org/wiki/Factor_of_safety describes is for mechanical/structural designs ). E.g. if the design tool came up with N um wide power rail they actually drawn the power rail as S*N where S > 1 . If their simulation computed that the maximum clock speed will be F MHz, they used D*F (where D < 1) in their published specification.

One of their executives enumerated their multiple layers of safety margins in the video they published upon initial release of their miners. Maybe somebody archived it somewhere in the KnC thread?

Europractice access is limited to educational/research/non-profit institutions. KnC from the beginning was a funded for-profit corporation. On the other hand Bitfury (person) initially developed his chip with cooperation from some Polish research institute before funding the Bitfury (corporation).

I keep mentioning Europractice/Mosis in the thread like this because it is an obvious and effective way of saving money in the initial stages of a design. Lots of folks keep mentioning multi-million dollar initial costs of developing the mining ASICs. But this is quite obviously not true if somebody knows how to use the educational discounts and how to deal with associated limitations on merchantability.

I will check the KncMiner thread and post a link if I can find it.

Also, That's pretty genius of Bitfury. I know of a few professors at University of Houston that are interested in developing some FPGAs for crypto-currency. Also, in college I took full advantage of those type of licenses.

Right now I looking into on-chip temperature sensors and voltage regulation to use in a feedback loop that may require outside IP if feasible which would require a license. Those licenses may forgo the ability to use the non-profit approach.

Something I pulled up after a quick search. It has digital output. Not sure if that is an issue and how to calibrate it.
https://www.design-reuse.com/sip/temperature-sensor-series-6-with-digital-output-tsmc-7nm-ff-high-accuracy-thermal-sensing-for-reliability-and-optimisation-ip-43229/?login=1

Moderator's note: This post was edited by frodocooper to remove a nested quote.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: NODEhaven on March 20, 2018, 09:01:35 PM

How does the overt ASICboost that Halong is implementing effect the logic on the chip?

I don't think that there's any non-bullshit information available publicly about Halong chips, so I'll refrain from making comments.

NODEhaven

jr. member

Activity: 58

Merit: 12

Quote from: 2112 on January 10, 2018, 03:11:53 PM

...

How does the overt ASICboost that Halong is implementing effect the logic on the chip?

Moderator's note: This post was edited by frodocooper to trim the quote from 2112.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: NODEhaven on March 09, 2018, 06:46:27 PM

This is by far one of the better threads I have come across on Bitcointalk.

If its not too much, could you describe a little on how KnC "sandbagged" the design and why didn't they use Europractice?

"sandbagging" means that they used quite large factors of safety in their design ( https://en.wikipedia.org/wiki/Factor_of_safety describes is for mechanical/structural designs ). E.g. if the design tool came up with N um wide power rail they actually drawn the power rail as S*N where S > 1 . If their simulation computed that the maximum clock speed will be F MHz, they used D*F (where D < 1) in their published specification.

One of their executives enumerated their multiple layers of safety margins in the video they published upon initial release of their miners. Maybe somebody archived it somewhere in the KnC thread?

Europractice access is limited to educational/research/non-profit institutions. KnC from the beginning was a funded for-profit corporation. On the other hand Bitfury (person) initially developed his chip with cooperation from some Polish research institute before funding the Bitfury (corporation).

I keep mentioning Europractice/Mosis in the thread like this because it is an obvious and effective way of saving money in the initial stages of a design. Lots of folks keep mentioning multi-million dollar initial costs of developing the mining ASICs. But this is quite obviously not true if somebody knows how to use the educational discounts and how to deal with associated limitations on merchantability.

NODEhaven

jr. member

Activity: 58

Merit: 12

Quote from: 2112 on January 10, 2018, 03:31:13 PM

Quote from: the_electronrancher on January 07, 2018, 02:58:07 PM

Your idea about starting at a larger node is a good one, you would certainly want to debug on a cheap process.

There's nothing to debug at the transistor level that is process-independent. In fact, even the transistor model changed from BSIM3 to BSIM4-family when you move from cheap to expensive processes.

The general topology of the models is already well known and open sourced:

http://bsim.berkeley.edu/models/

What is secret? The parameter values of those models. And even if you use MOSIS/Europractice or similar program you won't be able to publish those secret values. Without those you can't optimize in any sensible way beyond "sandbag the hell out of it and keep your fingers crossed". KnC did that already.

This is by far one of the better threads I have come across on Bitcointalk.

If its not too much, could you describe a little on how KnC "sandbagged" the design and why didn't they use Europractice?

QuintLeo

legendary

Activity: 1498

Merit: 1030

Quote from: Entropy-uc on January 05, 2018, 12:57:37 AM

Global Foundries operates on a standard contract fab model so it's not really surprising that they built the BFL devices.

Partly true - they do have some extensive contracts with IBM and AMD dating back to the "fab spinoff" days and amended/updated every so often that lock up a lot of their capacity if IBM or AMD wants that capacity.

The contract fab model applies to whatever is "left over".

2112

legendary

Activity: 2128

Merit: 1074

Quote from: the_electronrancher on January 07, 2018, 02:58:07 PM

Your idea about starting at a larger node is a good one, you would certainly want to debug on a cheap process.

There's nothing to debug at the transistor level that is process-independent. In fact, even the transistor model changed from BSIM3 to BSIM4-family when you move from cheap to expensive processes.

The general topology of the models is already well known and open sourced:

http://bsim.berkeley.edu/models/

What is secret? The parameter values of those models. And even if you use MOSIS/Europractice or similar program you won't be able to publish those secret values. Without those you can't optimize in any sensible way beyond "sandbag the hell out of it and keep your fingers crossed". KnC did that already.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: the_electronrancher on January 05, 2018, 01:47:44 PM

I'd like to learn a little more about this transistor level implementation, I'm having a hard time picturing what could reasonably be exploded or minimized in the hash core. Xor? It's just flops and wiring otherwise, I would be surprised if the flop was exploded, but maybe - if you have any links to check out, it would be an interesting read.

Here's the example of what can be optimized with the transistor-level knowledge.

SHA-256 has 64 rounds that when unrolled have values that once computed have to be used in 16 different places (fanout of 16). For this example lets simplify and assume that there are only 2 inputs and 6 outputs.

Code:

a<=
b<=
c<=
d<=
e<=
f<= x + y;

This can be optimized to:

Code:

a<=
b<=
c<= x + y;
d<=
e<=
f<= x + y;

The optimization is that the same value is computed twice, but in different physical locations on a die and the signal needs shorter routes from the source to the destination. Here's more in-depth explanation:

https://en.wikipedia.org/wiki/FO4

Note that the above optimization is the opposite of the ASICBOOST "optimization".

We know that recent Bitmain chips are have capability to work both in regular way and boosted with ASICBOOST (with theoretical maximum of about 25% savings). We also know that when used in the boosted configuration they need to be clocked much lower and have lower overall performance (and probably lower yield of chips that could work in boosted modes).

If Bitmain was capable of accurately simulating their chips they wouldn't waste their resources on that exercise because 25% is lower than the normal manufacturing tolerances on the process nodes they were using. Transistor-level simulation is nowadays more accurate than the manufacturing variance and one could actually simulate the performance at the various process corners.

From the above we can deduct that they don't have any sort of transistor-level design, they just use standard cells and sandbagging the design with wide safety margins. That is the same thing that KnCminer did years ago.

The other possibility is that Bitmain did implement their chips dual-capable (both boosted and un-boosted) for some non-technical, political or personal reasons. But that would mean that their chips are even less optimized than they could be without wasting space on the unused boosting logic.

2112

legendary

Activity: 2128

Merit: 1074

Quote from: Entropy-uc on January 05, 2018, 12:57:37 AM

The problem is you won't find a design house willing to work that way. They have their tool sets and their work flows and they aren't going to diverge from it. So you will need to buy your own set of design tools and find a team of borderline Asperger's cases to do the transistor design.

This is where I disagree, people are just looking for a wrong kind of design house. They need to look for designers experienced and interested in mixed-signal and power-electronics designs. This is significantly different than the predominant industry practice in digital logic design.

Technically there are 3 main points where Bitcoin mining chip differs from the typical modern digital IC:

1) SHA-256D is practically a fully self-testing circuit
2) SHA-256D has very high signal toggle rate (0.5, only 3dB below the theoretical maximum of 1.0 for a ring oscillator)
3) there are practically no external design requirements (like timing closure) the chip is 100% limited to either thermal/power (when over-clocking) or self-switching-noise (when under-volting)

The standard tool chains used in digital logic design fail to produce efficient designs with the above requirements:

1) heuristic layout optimization algorithms fail to converge on a design where each bit of the output depends on each bit of input, so the designers force round unrolling to achieve convergence
2) the methodology is mostly designed for timing-closure or test-driven-design, when this is completely not an problem here
3) the approximations made by the toolchains are very inaccurate in the interesting problem space (very high toggle rate and no timing demand whatsoever)

The end result is that standard tools produce designs that are way too conservative in terms of individual reliability of gates and flip-flops: they are way too reliable at the local-logic level and then trade this off for noise tolerance on the very long and high fan-out interconnections.

Quote from: Entropy-uc on January 05, 2018, 12:57:37 AM

Somebody should really fund a Professor to do the design work under an open hardware license. One the transistor design for SHA256 is done you just have to bring that into the fab's design tools and optimize for placement. Conductor losses are becoming dominant at these process nodes so that is where the biggest optimizations will be found.

Well, I haven't spoken with a Professor recently, but I did in the past. From that experience I can surmise that work on a Bitcoin miner could be a career-limiting move for a scientist in the current prevailing climate at the engineering schools.

If you are going to ask around here are the two good questions to ask:

1) why nobody considers plain old serial adders/subtractors for the most common operation in SHA-256D? One can add two 32 bit numbers with a few XOR gates and a D flip-flop with absolute minimum of power spent. It will just take 32 clocks, but who cares? Why such an obsession with parallel adders and complex carry-look-ahead logic when there's no real timing constraint?

2) why nobody considers the old trick of using differential logic (like ECL vs TTL) when the signal toggle rate is at 50%? The complementary logic gives great power saving provided that the toggle rate is much less, the closer to zero the better. This is of
no benefit here whatsoever.

So if you guys are looking for either commercial design houses or semiconductor design professors avoid the mainstream. In addition to the two categories above I could also suggest asking about past experience with GaAs or other exotic processes, which exercised less explored corners of the design mind-scape.

Remember, BitFury's original design may have been done on a kitchen table or in a garage, but they did not unroll and decisively beat all the experienced CAD-monkeys despite using much older fabrication process.

majlkcze

newbie

Activity: 25

Merit: 0

Interesting topic guys. I´m still thinking about things you are talking about there, and this should be possible.
I went through the whole process from FPGA to ASIC as a microelectronics student, from design to fabrication, yes I was personally in the clean room holding the wafers.

The saddest thing is that it was 5+ years back and the crypto was not so well known and I had no clue what can be done in this area.

Now I´m still on the same faculty as a PhD. student, I think I can pull some triggers and be helpful in this area.
If a group of members decide to try something, count me in.

Moderator's note: This post was edited by frodocooper to remove an unnecessary quote.

Topic: SHA256d IC design question (Read 1061 times)