when I finish i will upload.
the problem is with modulo for my self:)
try something like:
%define arg2f XMM1
%define arg3f XMM2
%define arg4f XMM3
%define arg1 RDI
%define arg2 RSI
%define arg3 RDX
%define arg4 RC
%define arg5 R8
%define arg6 R9
%macro MULT 1,2,3
;Multi XMM0:XMM1:XMM4:XMM5 by XMM2:XMM3:XMM6:XMM7
movups arg3f,[%1+%3]
movaps xmm7,arg3f
mulps arg3f,arg1f
movshdup arg4f, arg3f
addps arg3f, arg4f
movaps xmm6,xmm7
movhlps arg4f, arg3f
addss arg3f, arg4f
movss [%2+%3], arg3f;
mulps xmm6,arg2f
movshdup arg4f, xmm6
addps xmm6, arg4f
movaps arg3f,xmm7
movhlps arg4f, xmm6
addss xmm6, arg4f
movss [%2+4+%3], xmm6
mulps arg3f,xmm4
movshdup arg4f, arg3f
addps arg3f, arg4f
movaps xmm6,xmm7
movhlps arg4f, arg3f
addss arg3f, arg4f
movss [%2+8+%3], arg3f
mulps xmm6,xmm5
movshdup arg4f, xmm6
addps xmm6, arg4f
movhlps arg4f, xmm6
addss xmm6, arg4f
movss [%2+8+4+%3], xmm6
%endmacro
Are you saving some bits at the end of each dword for the carry? Otherwise you're going to lose accuracy in the final answer, because SIMD stuff will just overflow instead of carry.
I was thinking of applying the excellent technique of 5x52-bit numbers used in libsecp256k1, where we use 5 quadwords that each hold 52 bits of the real number each (except for the most significant qword which holds less). There's also a 10x26-bit variant for 32-bit machines.
If that technique is applied, we can defer the carry and add the numbers hundreds of times without corruption.
We would only need 3 SSE adds, or 2 AVX adds in that case. But I have heard that using the AVX instructions incurs some kind of speed penalty like AVX-512?
An alternative is just stuffing them in the R8-R15 registers and RAX/RBX for the 5 quadwords for both operands and then clobber one of the operands with the sum.