400734: f2 0f 51 d6 sqrtsd %xmm6,%xmm2
400738: 66 0f 2e d2 ucomisd %xmm2,%xmm2
40073c: 0f 8a 63 02 00 00 jp 4009a5
400742: 66 0f 28 f2 movapd %xmm2,%xmm6
400746: f2 0f 51 cd sqrtsd %xmm5,%xmm1
40074a: 66 0f 2e c9 ucomisd %xmm1,%xmm1
40074e: 0f 8a d9 01 00 00 jp 40092d
400754: 66 0f 28 e9 movapd %xmm1,%xmm5
400758: f2 0f 51 c7 sqrtsd %xmm7,%xmm0
40075c: 66 0f 2e c0 ucomisd %xmm0,%xmm0
400760: 0f 8a 47 01 00 00 jp 4008ad
400766: 66 0f 28 f8 movapd %xmm0,%xmm7
40076a: f2 0f 51 c3 sqrtsd %xmm3,%xmm0
40076e: 66 0f 2e c0 ucomisd %xmm0,%xmm0
400772: 0f 8a b5 00 00 00 jp 40082d
...which is sqrt-scalar-double.
4 instructions / 4 math operations.
What could be done differently (intel syntax follows):
movlpd xmm1, b //loading the first variable "b" to the lower part of xmm1
movhpd xmm1, bb //loading the second variable "bb" to the higher part of xmm1
SQRTPD xmm1, xmm1 //batch processing both variables for their square root, with one SIMD command
movlpd xmm2, bbb //loading the third variable "bbb" to the lower part of xmm2
movhpd xmm2, bbbb //loading the fourth variable "bbbb" to the higher part of xmm2
SQRTPD xmm2, xmm2 //batch processing their square roots
movlpd b, xmm1 //
movhpd bb, xmm1 // Returning all results from the register back memory
movlpd bbb, xmm2 //
movhpd bbbb, xmm2 //
SQRTPD - Square root - P(acked)-Double.
So now 4 maths instructions became 2 and the time got down in half (I've actually benchmarked the above and it goes near half). But in order to pack instructions (math or logical) you need to have similar processing load, similar operations. You can't have that in a scenario where it goes like
sqrt
add
shift
xor
and the function is changing...
But if you loaded 4x hashes together, you'd be looking at
sqrt(of the first) sqrt (of the second) sqrt (third) sqrt (fourth) (<=pack them)
add add add add (<=pack them)
shift shift shift shift (<=pack them)
xor xor xor xor (
...etc
I wasn't even aware of the above, until a couple of weeks ago when I got down to asm level to see what happens and why some Pascal output was slower than C output... then I run into http://x86.renejeschke.de as a reference where I was trying to understand the instructions and what they are doing, and then rewrote some instructions myself - like the above with the packed (I thought it was pretty easy really) and then, more recently, I went over the code of the asm hash functions of altcoins and bitcoin - and it was full of serial operations, despite "SSE/AVX use" / "SSE/AVX enhanced". And I'm like WHAT THE F***? This is all crippled.