I think you are vastly underestimating the potential of asics; a 130nm asic is ~40x as energy efficient as current 65nm fpgas for sha256 hashing according to papers ive seen and linked. A 40nm asic would therefore be closer to 100x more efficient than an FPGA.
From your link
http://rijndael.ece.vt.edu/sha3/publications/DSD11SHA3.pdf I compute 400 Mhash/J for a 130nm ASIC. So probably ~1000 Mhash/J at 40nm. This is
50x better than a 45nm FPGA (Spartan6 = 20 Mhash/J).
But this still does not change my mind: the first mining ASICs will likely be manufactured on the 130nm node, so their 400 Mhash/J characteristic will make them a 20x efficiency increase over 45nm FPGAs. Not much different from past 10x technological leaps.
The 400MH/J was based on the chip which was optimized to allow all SHA-3 algorithms perform at their optimal clock rate. It shouldn't be considered the pinacle of SHA-256 performance as including SHA-256 wasn't the main purpose of the chip. It should be taken as the bare minimum of what an underclocked and unoptimized design can acheive and as you point out that is already 20x.
Take a look at figure 9. Notice the line for SHA-256 is nearly vertical. This means the clock could be ramped up significantly without increasing the transistor count. The chip as evaluated ran @ 200 Mhz however you can see that 300 Mhz requires an insignificant increase in kGE. Now 300Mhz is already 50% higher performance but that shouldn't be taken as a max. The "sweet spot" is likely 600 Mhz or higher. Maybe a 1Ghz. So why didn't they test it at 1Ghz? That wasn't the point of the chip. The clock speed chosen was a compromise to neither hinder nor help any of the SHA-3 candidates.
Then you have to consider the algorithm used was inefficient from Bitcoin's perspectives. It is a streaming hash design where the output of one block is combined with input of a second block. Bitcoin is a static single block operation which significantly cuts the resources required. No communication lines back to the input have to be made because it is simply data -> data out. then new data in -> new data out.
20x is the performance for an unoptimized design running at an inefficient clock speed. A bitcoin optimized single pass double hasher (with all the shortcuts) running at 1Ghz on 130nm process is probably more like 40x to 50x. The move to smaller processes will be combined with increased experience and optimization so the jumps in performance beyond that are going to happen faster than Moore's law. The end game of 100x FPGA (and 500x GPU) performance on MH/W is likely conservative when at same manufacturing process (i.e. 28nm GPU, FPGA, ASIC).
However I do agree that Mhash/dollar will be a more interesting metric to watch than Mhash/J. I wonder why you think ASIC will contribute a 1000x improvement in this area (going from $1 per Mh/s to $1 per Gh/s)?
Basing on the square mm and a clock speed of 1 Ghz the raw manufacturing cost would be closer to $0.10 per GH/s. Now granted you have the NRE, the capital cost, the profit magins, yield losses, salaries, etc but even with 1000% markup <$1 per GH/s would be possible.
One way to look at it is the SHA-256 hasher only took 20kGE. Lets say scaling it to 1 Ghz required twice as many GE and you want to make it perform a double hash; so 80 kGE. Obviously you wouldn't make a chip that small. But hashing is perfectly parallel. Instead of 1 single hashing engine running 1Ghz you could lay down 20 parallel engines. So that on each clock 20 nonces are calculated simultaneously (20 GH/s @ 1Ghz). Even that would only be ~ 1.6M GE. Tiny small by modern chip standards (which have transistor counts in the billions). The $20 CPU in your smartphone likely has a higher transistor count.