I'm rather a fan of the PoW, seeing as how i wrote it and all. I think nobody has really sat down to think about how hard that "silly multiplication"
Have you actually profiled it? I have, and most of the time is currently being spent in the hash functions, not the multiply. Granted that is CPU not GPU, but I'm skeptical that multiply will be a major bottleneck. Someone may prove me wrong, we'll see.
On CPU multiplication is highly optimized, for GPU even the oh so hard to do methods will have big time problems. I'm willing to believe that most of time is spent in hashes on CPU, but that can change if somebody optimizes the difficult hash functions, and is only true because the CPU is so efficient at multiplication. Like I said I think doing the hashes on GPU and multiplying it on CPU will turn out to be the setup. How much that buys you I don't know. Maybe you triple performance of a 4770k by adding a 750ti? Whatever it is won't be any sempron with 1x pcie. Maybe a 750ti goes same speed as 4770k by itself. These are all possibilities that don't seem highly advantageous to me. Still going to get mopped up by a dual xeon on EC2 without the capital costs.
Pretty sure nobody has it, although Wolf has been quiet for the past day. I'm working on it.
AMD GCN, for instance, performs any VALU operation in 4 clocks. mul_hi and mul_lo are two disctinct operations though, so it takes 8 clock to multiply two 32-bit ints in all running threads. An addition or bitwise operations take 4. Multiplication is obviously not the bottleneck in this design.
This is true but it depends on what you think of as threads. thing is a barrel processor and a "clock" really takes as many cycles as it takes to get through the barrel. Really you have to figure instruction issuance rates which is the SIMD width * number of cores * clock rate. The number of resident threads in the barrel does not increase issuance, only the probability and instruction is able to issue on every clock.
Also GCN has only 24bit multiplier and there are some extra instruction for combining even 32bit multiplies into 64. You need 4 mul + some adds. And then you have mul_hi mul_lo. So it's a whole mess of instructions. Not to mention amount of memory required. Probably to big for lmem so you have lots of load/store traffic going on.