1) Montgomery Reduction is used.
2) The size of the multiprecision arithmetic is fixed.
3) An optimized sieve is running on the GPU.
4) An optimized primorial search is running on the GPU (double SHA-256).
Ok, here's my take on these claims.
1) There's no question about the fact that GPUs are faster at modular multiplication. Supercomputing has already linked plenty of papers about that.
The missing piece of information about his implementation is how he goes from multiplication to exponentiation. The Fermat's test in Primecoin is all about doing modular exponentiation. There's a well-known algorithm for that which uses modular squaring and multiplication. I think you are forced to do some branching on the GPU which slightly slows it down.
2) Fixing the precision definitely gives a minor speedup and makes the implementation easier. The only caveat is that it may not be future-proof when we move to longer chains.
3) Yes, I think it's possible to implement a much more efficient sieve if you exploit the shared memory on the GPU.
4) I'm guessing that primorial search refers to finding header hashes divisible by a primorial. The CPU implementation searches for hashes that are divisible by 7# (= 2 * 3 * 5 * 7). On the CPU this takes only a tiny fraction of time. If you have a fast GPU implementation of it, you can search for hashes divisible by much larger primorials. This might get him a minor speedup.
Note that his faster primorial search will be obsolete once mining protocol v0.2 is enforced. Link: http://www.peercointalk.org/index.php?topic=453.0