The L3 cache by itself is almost half of the chip.
I looked at an image of the Haswell die and appears to be less than 20%. The APU (GPU) is taking up more space on the consumer models. On the server models there is no GPU and the cache is probably a higher percentage of the die.
There is also a 64 bit multiply, which is I'm told is non-trivial. Once you combine that with your observation about Intel having a (likely persistent) process advantage (and also the inherent average unit cost advantage of a widely-used general purpose device), there just isn't much if anything left for an ASIC-maker to to work with.
So no I don't think the point is really valid. You won't be able to get thousands of times anything with a straightforward ASIC design here. There may be back doors though, we don't know. The point about lack of a clear writeup and peer review is valid.
The CPU has an inherent disadvantage in that it is designed to be a general purpose computing device so it can't be as specialized at any one computation as an ASIC can be.
This is obviously going to be true, but the scope of the task here is very different. Thousands of copies will not work.
I believe that is wrong. I suspect an ASIC can be designed that vastly outperform (at least on a power efficiency basis) and one of the reasons is the algorithm is so complex, thus it probably has many ways to be optimized with specific circuitry instead of generalized circuitry. My point is isolating a simpler ("enveloped") instruction such as aesinc would be a superior strategy (and embrace USB pluggable ASICs and get them spread out to the consumer).
Also I had noted (find my post in my thread a couple of months ago) that the way the AES is incorrectly employed as a random oracle (as the index to lookup in the memory table), the algorithm is very likely subject to some reduced solution space. This is perhaps Claymore's advantage (I could probably figure it out if I was inclined to spend sufficient time on it).
There is no cryptographic analysis of the hash. It might have impossible images, collisions, etc..
I strongly disagree.
The algorithm is *not* complex, it's very simple. Grab a random-indexed 128 bit value from the big lookup table. Mix it using a single round of AES. Store part of the result back. Use that to index the next item. Mix that with a 64 bit multiply. Store back. Repeat. It's intellectually very close to scrypt, with a few tweaks to take advantage of things that are fast on modern CPUs.
Claymore has no fundamental advantage beyond lots of memory bandwidth and compute. His results are actually slightly slower than what is achievable on a GPU with no algorithmic magic -- compare Claymore's speeds to tsiv's for nvidia and extrapolate another 10%-20% due to slightly better code.
Remember that there are two ways to implement the CryptoNight algorithm:
(1) Try to fit a few copies in cache and pound the hell out of them;
(2) Fit a lot of copies in DRAM and use a lot of bandwidth.
Approach (1) is what's being done on CPUs. Approach (2) is what's being done on GPUs. I tried implementing #2 on CPU and couldn't get it to perform as well as my back-of-the-envelope analysis suggests it should, but it's possible it could outperform the current CPU implementations by about 20%. (I believe yvg1900 tried something similar and came to the same conclusion I did). An ASIC approach might well be better off with #2, however, but it simply moves the bottleneck to the memory controller, and it's a hard engineering job compared to building an AES unit, a 64 bit multiplier, and 2MB of DRAM. But that 2MB of DRAM area limits you in a big way.
In my best professional opinion, barring funky weaknesses lingering within the single round of AES, CryptoNight is a very solid PoW. Its only real disadvantage is comparatively slow verification time, which really hurts the time to download and verify the blockchain.