I'm rather a fan of the PoW, seeing as how i wrote it and all. I think nobody has really sat down to think about how hard that "silly multiplication" is for GPU's and ASIC, and there is no evidence that a GPU miner has been created. I'm working on one in spare time to prove how much it will suck. I mean it will work, but won't be massively faster than CPU. Especially if someone sits down to optimize CPU for AVX. The ultimate combination is probably GPU for hashing and CPU for multiply. Might get somewhere doing that but it's a whole new kind of rig to have that kind of balance.
Lemme give you some numbers:
750ti = 160 32x32 bit multipliers @ 1gz = 160B 32bit per second
4770k = 16 64x64 bit multiplication per clock @ 3.6ghz = 57.6B 64bit muls ~= 330.4B 32bit per second
Theres no fighting that. It is raw ALU power.
You're underestimating the fraction of time spent in the hashes vs the multiply. Less than 12% of the current execution time is in gmp bignum functions with your current code, with 88% in hashing (22% of that in one hash function -- easy target for optimization, which I'm sure wolf or someone else has already done). The one saving grace there is that gmp is avx2 optimized on my platform already, and none of your hashes are yet. But that leaves things at still < 30% of time in bignum once everything's on an equal optimization footing. And it's likely that there's some low-hanging optimization possible, given that 6.7% of the time is spent in __gmpz_export and that there's no effort taken to avoid unnecessary allocation and deallocation of the mpzs. Hint: there are faster ways to get the data out of gmp if you don't care about portability.
Taking that all together, I'd guess that at most 15-20% of the eventual optimized runtime will be in the multiplication. The GPUs will win. You've created a *great* target for Claymore, though, with his pre-built library of bignum routines from writing the XPM miner.
(And to those reading this, no, I don't have an optimized miner - I ran it through a profiler to see what it was, but decided I was bored of miner tweaking this week.)
((p.p.s - no, GPU for hashing and CPU for multiply would be horrible. You'll just saturate your PCIe bandwidth.))