Are you saying there's no more gains to be had by optimising datapaths, efficient implementation and/or scaling up the number of hasher units?
and presumably you're saying that every time NVidia or ATI bring out a faster GPU that pushes faster math or higher polygon counts, that they have resorted to analogue optimisation to make it happen? i really doubt that many people resort to analogue asic design.. it just takes too long and the result isn't always predictable nor accurately simulateable. bitfury is a very notable exception (and not the rule).. but even bitfury expected a faster chip than they ended up with... (the labels on the chip say 5GH, which is at least double what they ended up with). Im not knocking the bitfury chip nor its designer, who I think is awesome! im just saying that expecting every asic designer to go into the analogue domain is unlikely and impractical. theres plenty that can be done in the digital domain.
I'm not going to fight with your strawmen about Nvidia/ATI. Just quoting for the future reference.
There's no need to bullshit here about "optimising datapaths". SHA-256 is basically just a pair of 32-bit-wide shift registers with some cobinatorial logic thrown in the feedback loops. The cryptographers at NIST/NSA/etc. worked really hard to make sure that this logic is not minimisable in any meaningfull way because that would make it susceptible to cryptoanalysis. The "architectural" tricks would've already been exploited by the cryptoanalysts. There isn't any way to optimize power by e.g. not clocking parts of the circuit when not in productive use, which is where the most of modern CPUs and GPUs save power. So please no further low-power bullshit unless you can tell us how your low-power strategy applies to a circuit with 50% signal toggle probability. Nobody's going to run a pocket bitmine on a battery power.
There are some implementation tricks possible that minimize the critical timing paths, but they are all already published in an open literature or at most behind the ACM/IEEE paywalls. Other designers already took advantage of them, bitfury even sort-of republished the information behind the paywall.
Anyone who implemented SHA-256, even in software, knows that it is a self-testing logic, so please no further bullshit about design for testability.
The last remaining avenue for the significant gains is by somehow exploiting the high toggle rate in the circuit. The gates and flip/flops change state at the rate very close to the maximum possible, which normally happens only in a test structure called ring oscillator. My personal bet is that the progress will come from designers that take advantage of that and instead of using the bang-bang static logic use something out-of-ordinary: maybe some relatively obscure dynamic logic or some low-noise logic current-mode logic (a.k.a. source-coupled logic). Or the combination of the above: DyCML. Or something which I haven't even heard of.
Anyway, if anyone of you is going to visit Cointerra: ask them about their BSIM4 models, which of the 5 process corners they've simulated thus far, pay close attention if they have anyone working in analog simulators to optimize metal thickness and optimize the transistor/gate geometry to facilitate the best power noise bypass.
Bitfury's 5GH chip currently hashes slower primarily because nobody had spend any time on the problem of heath-sinking the QFN package. Also QFN is wire-bonded, which basically is a collection of half-turn induction coils. Those require very careful analog resonant pin/pad connection design, which again nobody had spend time on.
The way I see it, the progress will come from another company, which has designers experienced in analog/mixed signal ICs and who did
high-power designs like cellular tower radios or synthetic aperture radars.
Edits: spelling and underlining.