Well, not exactly 2 thanks to some optimizations possible
you can drop the last 3 rounds completely (they don't change H), and lose part of the previous round (you only need the E output of the 4th-to-last)
Initial rounds can be optimized as well, as the last DWORD of hMerkleRoot and nTime/nBits don't change between loops, so you can drop the equivalent of ~3 rounds there as well.
Same thing goes for optimizing/precalculating parts of the W mangling, as we're feeding in quite a bit of constants.
Register access is basically free on GPUs (they mask reg r/w by pipelining 4 "threads" on the shader pipeline).
Ch() can be done in 1 cycle, and Maj() in 2.
Also, what LUT accesses? just hardcode the K constants in the instruction stream.
So while you came up with a somewhat reasonable result, you did so by pure chance using invalid assumptions and numbers.
ArtForz, you are quite the legend on these forums... Glad to see you here!