I don't disagree with any of this in principle. There is just no thorough analysis of exactly how it will end up on GPU. Sure maybe multiplication is 20% on CPU but that means what 60-80% on GPU? we know it will slow down there and a lot of it depends on how things are implemented. XPM has lamo big integer, the things almost always can be fit in registers. These are too large. And i think 2048bit * 32*256 is way too large for local memory. So where CPU is doing some fast 64bit multiplies in registers and has that nice data cache, GPU is executing handful of operations and having to hit global memory to do the same work. Implementing on GPU is quite difficult if you do it the GMP way which is karatsuba i believe.
Keep in mind X11 is not so much faster on GPU. Maybe 3x. Throw a wrench into the works that takes 60% of time and I think that metric will suffer. I can't guarantee it. Maybe not so much, maybe a little. Not even sure it matters. It's just a PoW. People will work it with whatever they have.
X11 nvidia is 6x cpu, just to keep the record straight, though it's always possible that some of C&C's groestl optimizations could be applied to avx2 as well. That's not the magic of nvidia, it's good programming and optimization.
There's no way a good implementation of this would hit global in a non-streaming way -- such a design would be silly and doing it wrong. Yes, it might be optimal to stream all of the result hashes to memory, read them back in a different kernel with fewer threads per block, do the 2048 bit math, and then stuff them back into global for the final sha256, but two full streams of the hashes is well within the bandwidth capacity of the GPU for the speed of hashing and math we're talking about.
You could imagine some fun ways of parallelizing it, given that the multiplication is commutative - that sha256 and haval would combine nicely in one big parallel multiply across nonces to produce a 512 bit next operand. Fun thought experiment. :-)
GPU would likely just use schoolbook. GMP's multiplication algorithm used depends a little on the bit sizes involved, but it's probably Toom-Cook at these sizes. (It's a little tricky because you know that one of the operands is always <= 512 bits, and I don't know how gmp optimizes that case.) Karatsuba and TC are tricky, but they're not magic, just code.
(I have no philosophical or financial horse in this race -- I think the excessive pursuit of CPU-only coins is silly, and I do agree with you that the large bit size is a deterrent; I haven't seen, e.g., any public gpu miners for riecoin, and the bit size is one of the big reasons. But if the coin goes moonward, it'll be done, and I think it'll put the CPUs out to pasture.)