I've taken care of most of the redundant multiplications and register accesses that I could find.
But now, I'm trying to comment out some of the redundant commands in the cpu hash asm files and need a hand:
LAB_NEXT_NONCE:
mov rcx, 256 ; 256 - rcx is # of SHA-2 rounds
; mov rax, 64 ; 64 - rax is where we expand to
LAB_SHA:
push rcx
lea rcx, qword [data+(1024)] ; + 1024
lea r11, qword [data+(256)] ; + 256
I'm wanting to get rid of that redundant rcx move since it unnecessarily represents a constant for a total of three instructions. I know it's not much of anything, but it's a start at weeding out redundant code.
Also, is it just me or do I see rax being set to 0 and then being multiplied by 4 before added to data? And then being multiplied by 4 for no apparent reason?
Edit: I figured out it's part of the macro I overlooked. Haven't slept yet; probably should.
%endrep
add r11, LAB_CALC_UNROLL*LAB_CALC_PARA*16
cmp r11, rcx
jb LAB_CALC
pop rcx
mov rax, 0
; Load the init values of the message into the hash.
movntdqa xmm7, [init]
pshufd xmm5, xmm7, 0x55 ; xmm5 == b
pshufd xmm4, xmm7, 0xAA ; xmm4 == c
pshufd xmm3, xmm7, 0xFF ; xmm3 == d
pshufd xmm7, xmm7, 0 ; xmm7 == a
movntdqa xmm0, [init+16]
pshufd xmm8, xmm0, 0x55 ; xmm8 == f
pshufd xmm9, xmm0, 0xAA ; xmm9 == g
pshufd xmm10, xmm0, 0xFF ; xmm10 == h
pshufd xmm0, xmm0, 0 ; xmm0 == e
LAB_LOOP:
;; T t1 = h + (Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)) + ((e & f) ^ AndNot(e, g)) + Expand32(g_sha256_k[j]) + w[j]
%macro lab_loop_blk 0
movntdqa xmm6, [data+rax*4]
paddd xmm6, g_4sha256_k[rax*4]
add rax, 4
As a tangent, I found this and wonder if we might be able to code something from it. "There are two meet-in-the-middle preimage attacks against SHA-2 with a reduced number of rounds. The first one attacks 41-round SHA-256 out of 64 rounds with time complexity of 2253.5 and space complexity of 216, and 46-round SHA-512 out of 80 rounds with time 2511.5 and space 23. The second one attacks 42-round SHA-256 with time complexity of 2251.7 and space complexity of 212, and 42-round SHA-512 with time 2502 and space 222."
So basically, if we store some of the computed hash into a look-up table in memory as we're computing, there's a good chance that we could speed-up hashing significantly for the first 42 rounds. Is that what you've already taken advantage of as you mentioned before?