nly if you use 256-byte which should be avoided until AVX2. But you're only using XMM, not YMM so none of that comes into play. And sorry, I'm about to hit the wall. I've been up all night. Basically, if you're wanting to use the full YMM registers, you can't do much for integers as of yet. It's coming. However, you can use the lower half of the YMM by specifying the XMM equivalent. The XMM registers can use your ADDs, Shifts, XORs and ANDs via AVX. So,
Hm, the Miner uses XMM registers (SSE2) in current implementation already.
As for the MOVNTDQA, it combines smaller writes into a single stream via a buffer. It's mainly useful for transferring multiple eax into xmm by converting it into a 64-byte write which increases the bandwidth approx. 10x.
It is not applicable for SHA256 calculating.
So, you're telling me that we're not moving 16-byte register data into 64-byte xmm registers? Because I see a few places here where it is. Like here:
ELSE
movdqa xmm3, [zsp+5*16]
movdqa xmm4, [zsp+6*16]
movdqa xmm5, [zsp+7*16]
paddd xmm3, [zsi+5*16]
paddd xmm4, [zsi+6*16]
paddd xmm5, [zsi+7*16]
movdqa [zbx+5*16], xmm3
movdqa [zbx+6*16], xmm4
movdqa [zbx+7*16], xmm5
Changed to:
ELSE
movntdqa xmm3, [zsp+5*16] ;Fetches 64-bytes of zsp and puts into buffer. Reads an extra 16-bytes, but is transferring at 7.5x.
movntdqa xmm4, [zsp+6*16] ;zsp already buffered so doesn't need to be read again.
movntdqa xmm5, [zsp+7*16] ;zsp also buffered so write speed is also increased.
paddd xmm3, [zsi+5*16]
paddd xmm4, [zsi+6*16]
paddd xmm5, [zsi+7*16]
movdqa [zbx+5*16], xmm3
movdqa [zbx+6*16], xmm4
movdqa [zbx+7*16], xmm5
Note that these changes only work for SSE4 so an if-else statement is required.