The verification angle also has two other issues you aren't considering: consensus consistency. Use of GMP in validation would make the exact behavior of GMP consensus critical, which is a problem because different systems run different code. (GMP also has a license which is more restrictive than the Bitcoin software).
I would have never thought consistency might have been a problem with integer values until I kind of learned this the hard way, losing a day, getting beaten by two multiplications in the end of field 10x26 while converting the whole thing into SSE2 run on MMX registers for my 12yr old pentium-m laptop...
where R1 >>4 = 64, and R0 >>4 = 977... there was just no way to get it compute with a c * 64 / c * 977 like this: "PMULUDQ (register with 64 or 977 as source) and (register with c) as target.
I still don't know why a left shift by 6 is not the same as a multiplication with 64 - if overflowing and wrapping around is not at play (or maybe it is I don't know - but from what I see bit count should be 53 and 56, far from overflowing).
Anyway, my mind is still perplexed about this but I had to work around it until the test didn't break.
For the *64 I did it with shifting left by 6.
For the *977 I did it with a monstrosity where c got copied into multiple registers and then each copy got shifted appropriately and then all the shifted c's got added.
Anyway, despite this expensive workaround, native 64bit arithmetic on mmx registers trounced the gcc/clang/icc-compiled versions for a -m32 build. Opcode size was also reduced (1900 vs 2800+ bytes for field_mul, 1300 vs 2200+ bytes for field_sqr).
If you don't mean using GMP but just using different operations for non-sidechannel sensitive paths-- the library already does that in many places-- though not for FE mul/add. FE normalizes do take variable time in verification. If you have a speedup based on that feel free to submit it!
Right now the only massive speedup I have achieved is in the 32bit version where compilers use 32 bit registers (instead of 64 bit mmx/sse registers). For the 64bit version, with non-avx use, I'm at ~10% gains in non-commented code (which seems bad for reviewing) of which 3-5% was a recent gain, by reducing the clobbering parameters on the asm and manually interleaving the initial pushes and final pops at convenient stages (like multiplication and addition stalls).
The faster 32-bit version is below (I did add comments on what each line does, in case it would be useful to others, but I wasn't considering it for "submission" due to the array use which is not sidechannel-resistant (and I thought it was a "requirement"). Essentially an array is used at the start for storing the results of the multiplications and additions which are non-linear and thus can be computed at the start.)
On my laptop (pentium-m 2.13ghz) bench_verify (with endomoprhism) is down to 350us from 570us.
On my desktop (q8200 @1.86ghz) it's at 404us down from 652us. It could probably go down a further 5-10% if code readability suffers a lot (it's already suffering from some manual interleaving of memory operations with muls and adds to gain ~7-10%).
...
...
...
* In core2, it doesn't make a difference if it's xmm or mm registers (except a 2-3% speedup with the removal of EMMS in the end). On the pentium-m, the xmm register operations are very slow to begin with, I suspect they are emulated in terms of width, leading to twice the operations internally, but the mmx registers are mapped on the 80bit FPU registers which have good actual width and proper speed.
SECP256K1_INLINE static void secp256k1_fe_mul_inner(uint32_t *r, const uint32_t *a, const uint32_t * SECP256K1_RESTRICT b) {
/* uint64_t c, d;*/
uint64_t result[19]; /*temp storage array*/
uint32_t tempstor[2]; /*temp storage array*/
const uint32_t M = 0x3FFFFFFUL, R0 = 0x3D10UL /* ,R1 = 0x400UL */ ;
tempstor[0]=M;
tempstor[1]=R0;
/*tempstor[2] for R1 isn't needed. It's 1024, so shifting left by 10 instead*/
__asm__ __volatile__(
/* Part #1: The multiplications and additions of
*
* d = (uint64_t)(uint64_t)a[0] * b[9]
+ (uint64_t)a[1] * b[8]
+ (uint64_t)a[2] * b[7]
+ (uint64_t)a[3] * b[6]
+ (uint64_t)a[4] * b[5]
+ (uint64_t)a[5] * b[4]
+ (uint64_t)a[6] * b[3]
+ (uint64_t)a[7] * b[2]
+ (uint64_t)a[8] * b[1]
+ (uint64_t)a[9] * b[0]; */
"MOVD 0(%0), %%MM0\n" /*a0 */
"MOVD 36(%1), %%MM2\n" /*b9 */
"MOVD 4(%0), %%MM1\n" /* a1 */
"MOVD 32(%1), %%MM3\n" /* b8 */
"PMULUDQ %%MM0, %%MM2\n" /*a0 * b9*/
"PMULUDQ %%MM1, %%MM3\n" /*a1 * b8*/
"MOVD 8(%0), %%MM4\n" /*a2 */
"MOVD 28(%1), %%MM6\n" /*b7 */
"MOVD 12(%0), %%MM5\n" /* a3 */
"MOVD 24(%1), %%MM7\n" /* b6 */
"PMULUDQ %%MM4, %%MM6\n" /*a2 * b7*/
"PMULUDQ %%MM5, %%MM7\n" /*a3 * b6*/
"MOVD 16(%0), %%MM0\n" /*a4 */
"MOVD 20(%0), %%MM1\n" /* a5 */
"PADDQ %%MM2, %%MM3\n"
"PADDQ %%MM6, %%MM7\n"
"PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
"MOVD 20(%1), %%MM2\n" /*b5 */
"MOVD 16(%1), %%MM3\n" /* b4 */
"PMULUDQ %%MM0, %%MM2\n" /*a4 * b5*/
"PMULUDQ %%MM1, %%MM3\n" /*a5 * b4*/
"MOVD 24(%0), %%MM0\n" /*a6 */
"PADDQ %%MM2, %%MM3\n"
"MOVD 28(%0), %%MM1\n" /* a7 */
"PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
"MOVD 12(%1), %%MM2\n" /*b3 */
"MOVD 8(%1), %%MM3\n" /* b2 */
"PMULUDQ %%MM0, %%MM2\n" /*a6 * b3*/
"PMULUDQ %%MM1, %%MM3\n" /*a7 * b2*/
"PADDQ %%MM2, %%MM3\n"
"MOVD 32(%0), %%MM0\n" /*a8 */
"MOVD 36(%0), %%MM1\n" /* a9 */
"PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
"MOVD 4(%1), %%MM2\n" /*b1 */
"MOVD 0(%1), %%MM3\n" /* b0 */
"PMULUDQ %%MM0, %%MM2\n" /*a8 * b1*/
"PMULUDQ %%MM1, %%MM3\n" /*a9 * b0*/
"PADDQ %%MM2, %%MM3\n"
"MOVD 4(%1), %%MM2\n" /*b1 */
"PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
"MOVD 8(%1), %%MM3\n" /* b2 */
"MOVQ %%MM7, 0(%2)\n" /* extract result[0] */
/* Part #2: The multiplications and additions of
*
*
d += (uint64_t)a[1] * b[9]
+ (uint64_t)a[2] * b[8]
+ (uint64_t)a[3] * b[7]
+ (uint64_t)a[4] * b[6]
+ (uint64_t)a[5] * b[5]
+ (uint64_t)a[6] * b[4]
+ (uint64_t)a[7] * b[3]
+ (uint64_t)a[8] * b[2]
+ (uint64_t)a[9] * b[1]; */
"PMULUDQ %%MM1, %%MM2\n" /*a9 * b1*/
"PMULUDQ %%MM0, %%MM3\n" /*a8 * b2*/
"MOVD 28(%1), %%MM6\n" /*b7 */
"MOVD 32(%1), %%MM7\n" /* b8 */
"PMULUDQ %%MM5, %%MM6\n" /*a3 * b7*/
"PMULUDQ %%MM4, %%MM7\n" /*a2 * b8*/
"PADDQ %%MM2, %%MM3\n"
"MOVD 4(%0), %%MM0\n" /*a1 */
"MOVD 36(%1), %%MM2\n" /*b9 */
"PADDQ %%MM3, %%MM6\n"
"MOVD 16(%0), %%MM1\n" /* a4 */
"PADDQ %%MM6, %%MM7\n" /*keeping result additions in mm7*/
"MOVD 24(%1), %%MM3\n" /* b6 */
"PMULUDQ %%MM0, %%MM2\n" /*a1 * b9*/
"PMULUDQ %%MM1, %%MM3\n" /*a4 * b6*/
"MOVD 28(%0), %%MM4\n" /* a7 */
"MOVD 12(%1), %%MM5\n" /* b3 */
"PADDQ %%MM2, %%MM7\n"
"MOVD 20(%0), %%MM0\n" /*a5 */
"MOVD 24(%0), %%MM1\n" /* a6 */
"PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
"MOVD 20(%1), %%MM2\n" /*b5 */
"MOVD 16(%1), %%MM3\n" /* b4 */
"PMULUDQ %%MM0, %%MM2\n" /*a5 * b5*/
"PMULUDQ %%MM1, %%MM3\n" /*a6 * b4*/
"PMULUDQ %%MM4, %%MM5\n" /*a7 * b3*/
"MOVD 20(%1), %%MM6\n" /* b5 */
"PADDQ %%MM2, %%MM7\n"
"MOVD 16(%0), %%MM2\n" /*a4 */
"PADDQ %%MM5, %%MM7\n"
"PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
"MOVD 24(%1), %%MM5\n" /*b6 */
"MOVQ %%MM7, 8(%2)\n" /* extract result[1] */
/* Part #3: The multiplications and additions of
*
* d += (uint64_t)a[2] * b[9]
+ (uint64_t)a[3] * b[8]
+ (uint64_t)a[4] * b[7]
+ (uint64_t)a[5] * b[6]
+ (uint64_t)a[6] * b[5]
+ (uint64_t)a[7] * b[4]
+ (uint64_t)a[8] * b[3]
+ (uint64_t)a[9] * b[2];*/
"PMULUDQ %%MM1, %%MM6\n" /*a6 * b5*/
"MOVD 16(%1), %%MM7\n" /*b4 */
"PMULUDQ %%MM0, %%MM5\n" /*a5 * b6*/
"PMULUDQ %%MM4, %%MM7\n" /*a7 * b4*/
"MOVD 36(%1), %%MM3\n" /*b9 */
"MOVD 12(%0), %%MM1\n" /* a3 */
"PADDQ %%MM6, %%MM5\n"
"MOVD 32(%1), %%MM4\n" /* b8 */
"PADDQ %%MM5, %%MM7\n" /*keeping result additions in mm7*/
"MOVD 8(%0), %%MM0\n" /* a2 */
"MOVD 28(%1), %%MM5\n" /*b7 */
"PMULUDQ %%MM1, %%MM4\n" /*a3 * b8*/
"PMULUDQ %%MM2, %%MM5\n" /*a4 * b7*/
"PMULUDQ %%MM0, %%MM3\n" /*a2 * b9*/
"MOVD 12(%1), %%MM6\n" /*b3 */
"PADDQ %%MM4, %%MM7\n"
"PADDQ %%MM5, %%MM7\n"
"MOVD 36(%0), %%MM4\n" /* a9 */
"MOVD 32(%0), %%MM5\n" /*a8 */
"PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
"MOVD 8(%1), %%MM3\n" /*b2 */
"PMULUDQ %%MM5, %%MM6\n" /*a8 * b3*/
"PMULUDQ %%MM4, %%MM3\n" /*a9 * b2 - (order is b2 * a9) */
"MOVD 12(%1), %%MM0\n" /*b3 */
"PADDQ %%MM6, %%MM7\n"
"MOVD 32(%1), %%MM6\n" /*b8 */
"PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
"MOVD 16(%1), %%MM3\n" /*b4 */
"MOVQ %%MM7, 16(%2)\n" /* extract result[2] */
/* Part #4: The multiplications and additions of
*
*
* d += (uint64_t)a[3] * b[9]
+ (uint64_t)a[4] * b[8]
+ (uint64_t)a[5] * b[7]
+ (uint64_t)a[6] * b[6]
+ (uint64_t)a[7] * b[5]
+ (uint64_t)a[8] * b[4]
+ (uint64_t)a[9] * b[3]; */
"PMULUDQ %%MM4, %%MM0\n" /*a9 * b3*/
"MOVD 36(%1), %%MM7\n" /* b9 */
"PMULUDQ %%MM5, %%MM3\n" /*a8 * b4*/
"PMULUDQ %%MM2, %%MM6\n" /*a4 * b8*/
"PMULUDQ %%MM1, %%MM7\n" /*a3 * b9*/
"PADDQ %%MM0, %%MM3\n"
"MOVD 24(%1), %%MM4\n" /*b6 */
"MOVD 20(%1), %%MM5\n" /* b5 */
"PADDQ %%MM3, %%MM6\n"
"MOVD 20(%0), %%MM0\n" /*a5 */
"MOVD 28(%0), %%MM2\n" /*a7 */
"PADDQ %%MM6, %%MM7\n" /*keeping result additions in mm7*/
"MOVD 24(%0), %%MM6\n" /*a6 */
"MOVD 28(%1), %%MM3\n" /*b7 */
"PMULUDQ %%MM2, %%MM5\n" /*a7 * b5*/
"PMULUDQ %%MM6, %%MM4\n" /*a6 * b6 */
"PMULUDQ %%MM0, %%MM3\n" /*a5 * b7*/
"PADDQ %%MM5, %%MM7\n"
"MOVD 16(%0), %%MM1\n" /*a4 */
"PADDQ %%MM4, %%MM7\n"
"MOVD 32(%1), %%MM5\n" /*b8 */
"PADDQ %%MM3, %%MM7\n"
"MOVD 28(%1), %%MM4\n" /*b7 */
"MOVQ %%MM7, 24(%2)\n" /* extract result[3] */
/* Part #5: The multiplications and additions of
*
d += (uint64_t)a[4] * b[9]
+ (uint64_t)a[5] * b[8]
+ (uint64_t)a[6] * b[7]
+ (uint64_t)a[7] * b[6]
+ (uint64_t)a[8] * b[5]
+ (uint64_t)a[9] * b[4]; */
"PMULUDQ %%MM6, %%MM4\n" /*a6 * b7 */
"MOVD 24(%1), %%MM3\n" /* b6 */
"MOVD 36(%1), %%MM7\n" /*b9 */
"PMULUDQ %%MM0, %%MM5\n" /*a5 * b8*/
"PMULUDQ %%MM2, %%MM3\n" /*a7 * b6*/
"PMULUDQ %%MM1, %%MM7\n" /*a4 * b9*/
"PADDQ %%MM4, %%MM5\n"
"MOVD 36(%0), %%MM4\n" /*a9 */
"PADDQ %%MM5, %%MM3\n"
"MOVD 20(%1), %%MM5\n" /*b5 */
"PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
"MOVD 32(%0), %%MM3\n" /* a8 */
"MOVD 16(%1), %%MM1\n" /*b4 */
"PMULUDQ %%MM3, %%MM5\n" /*a8 * b5 */
"PMULUDQ %%MM4, %%MM1\n" /*a9 * b4 */
"PADDQ %%MM5, %%MM7\n"
"MOVD 28(%1), %%MM3\n" /*b7 */
"MOVD 32(%1), %%MM5\n" /*b8 */
"PADDQ %%MM1, %%MM7\n"
"MOVQ %%MM7, 32(%2)\n" /* extract result[4] */
/* Part #6: The multiplications and additions of
*
*
* d += (uint64_t)a[5] * b[9]
+ (uint64_t)a[6] * b[8]
+ (uint64_t)a[7] * b[7]
+ (uint64_t)a[8] * b[6]
+ (uint64_t)a[9] * b[5];
*/
"PMULUDQ %%MM2, %%MM3\n" /*a7 * b7*/
"MOVD 20(%1), %%MM1\n" /* b5 */
"MOVD 36(%1), %%MM7\n" /*b9 */
"PMULUDQ %%MM6, %%MM5\n" /*a6 * b8*/
"PMULUDQ %%MM4, %%MM1\n" /*a9 * b5 */
"PMULUDQ %%MM0, %%MM7\n" /*a5 * b9*/
"PADDQ %%MM3, %%MM5\n"
"MOVD 24(%1), %%MM3\n" /*b6 */
"PADDQ %%MM1, %%MM5\n"
"MOVD 32(%0), %%MM1\n" /* a8 */
"PADDQ %%MM5, %%MM7\n" /*keeping result additions in mm7*/
"PMULUDQ %%MM1, %%MM3\n" /*a8 * b6 */
"MOVD 24(%1), %%MM0\n" /* b6 */
"MOVD 32(%1), %%MM5\n" /*b8 */
"PADDQ %%MM3, %%MM7\n"
"MOVQ %%MM7, 40(%2)\n" /* extract result[5] */
/* Part #7: The multiplications and additions of
*
*
* d += (uint64_t)a[6] * b[9]
+ (uint64_t)a[7] * b[8]
+ (uint64_t)a[8] * b[7]
+ (uint64_t)a[9] * b[6]; */
"PMULUDQ %%MM4, %%MM0\n" /*a9 * b6 */
"MOVD 28(%1), %%MM3\n" /*b7 */
"MOVD 36(%1), %%MM7\n" /*b9 */
"PMULUDQ %%MM2, %%MM5\n" /*a7 * b8*/
"PMULUDQ %%MM1, %%MM3\n" /*a8 * b7*/
"PMULUDQ %%MM6, %%MM7\n" /*a6 * b9*/
"PADDQ %%MM0, %%MM5\n"
"PADDQ %%MM3, %%MM7\n"
"MOVD 24(%1), %%MM0\n" /* b6 */
"PADDQ %%MM5, %%MM7\n" /*adding results to mm7 */
"MOVQ %%MM7, 48(%2)\n" /* extract result[6] */
/* Part #8: The multiplications and additions of 3 separate results
*
*
d += (uint64_t)a[7] * b[9]
+ (uint64_t)a[8] * b[8]
+ (uint64_t)a[9] * b[7]; result 7
d += (uint64_t)a[8] * b[9]
+ (uint64_t)a[9] * b[8]; result 8
d += (uint64_t)a[9] * b[9]; result 9 */
"MOVD 28(%1), %%MM3\n" /*b7 */
"MOVD 32(%1), %%MM5\n" /*b8 */
"MOVD 36(%1), %%MM7\n" /*b9 */
"PMULUDQ %%MM4, %%MM3\n" /*a9 * b7 */
"PMULUDQ %%MM1, %%MM5\n" /*a8 * b8*/
"MOVQ %%MM7, %%MM6\n" /*b9 */
"PMULUDQ %%MM2, %%MM7\n" /*a7 * b9*/
"PADDQ %%MM3, %%MM5\n"
"PADDQ %%MM5, %%MM7\n"
"MOVQ %%MM6, %%MM3\n" /*b9 */
"MOVD 32(%1), %%MM5\n" /*b8 */
"MOVQ %%MM7, 56(%2)\n" /* extract result[7] */
"PMULUDQ %%MM1, %%MM6\n" /*a8 * b9*/
"PMULUDQ %%MM4, %%MM5\n" /*a9 * b8*/
"PMULUDQ %%MM4, %%MM3\n" /*a9 * b9*/
"MOVD 8(%0), %%MM7\n" /*a2*/
"PADDQ %%MM5, %%MM6\n"
"MOVQ %%MM3, 72(%2)\n" /* extract result[9] */
"MOVQ %%MM6, 64(%2)\n" /* extract result[8] */
/* Part #9: The multiplications and additions of
*
* c += (uint64_t)a[0] * b[8]
+ (uint64_t)a[1] * b[7]
+ (uint64_t)a[2] * b[6]
+ (uint64_t)a[3] * b[5]
+ (uint64_t)a[4] * b[4]
+ (uint64_t)a[5] * b[3]
+ (uint64_t)a[6] * b[2]
+ (uint64_t)a[7] * b[1]
+ (uint64_t)a[8] * b[0]; */
"PMULUDQ %%MM7, %%MM0\n" /*a2 * b6 */
"MOVD 0(%1), %%MM3\n" /* b0 */
"PMULUDQ %%MM1, %%MM3\n" /*a8 * b0*/
"MOVD 4(%1), %%MM7\n" /* b1 */
"PMULUDQ %%MM2, %%MM7\n" /*a7 * b1*/
"MOVD 4(%0), %%MM1\n" /*a1*/
"PADDQ %%MM0, %%MM3\n"
"MOVD 20(%1), %%MM4\n" /*b5*/
"PADDQ %%MM3, %%MM7\n"
"MOVD 12(%0), %%MM2\n" /*a3*/
"MOVD 28(%1), %%MM5\n" /*b7*/
"PMULUDQ %%MM2, %%MM4\n" /*a3 * b5*/
"MOVD 16(%0), %%MM3\n" /*a4*/
"MOVD 16(%1), %%MM6\n" /*b4*/
"PMULUDQ %%MM1, %%MM5\n" /*a1 * b7*/
"MOVD 0(%0), %%MM0\n" /*a0*/
"PMULUDQ %%MM6, %%MM3\n" /*b4 * a4*/
"MOVD 32(%1), %%MM6\n" /*b8*/
"PMULUDQ %%MM0, %%MM6\n" /*a0 * b8*/
"PADDQ %%MM4, %%MM5\n"
"MOVD 24(%0), %%MM4\n" /*a6*/
"PADDQ %%MM6, %%MM3\n"
"PADDQ %%MM5, %%MM7\n"
"MOVD 8(%1), %%MM6\n" /*b2*/
"PADDQ %%MM3, %%MM7\n"
"MOVD 12(%1), %%MM5\n" /*b3*/
"MOVD 20(%0), %%MM3\n" /*a5*/
"PMULUDQ %%MM4, %%MM6\n" /*a6 * b2*/
"PMULUDQ %%MM3, %%MM5\n" /*a5 * b3*/
"PADDQ %%MM6, %%MM7\n"
"MOVD 0(%1), %%MM4\n" /*b0*/
"PADDQ %%MM5, %%MM7\n" /*addition results on mm7*/
"MOVD 8(%0), %%MM3\n" /*a2*/
"MOVD 4(%1), %%MM5\n" /*b1*/
"MOVQ %%MM7, 80(%2)\n" /* extract result[10] */
/* Part #10: The multiplications and additions of
*
* c += (uint64_t)a[0] * b[3]
+ (uint64_t)a[1] * b[2]
+ (uint64_t)a[2] * b[1]
+ (uint64_t)a[3] * b[0]; result11
c += (uint64_t)a[0] * b[1]
+ (uint64_t)a[1] * b[0]; result12
c = (uint64_t)a[0] * b[0]; result13
c += (uint64_t)a[0] * b[2]
+ (uint64_t)a[1] * b[1]
+ (uint64_t)a[2] * b[0]; result14 */
"PMULUDQ %%MM4, %%MM2\n" /*b0 * a3*/
"PMULUDQ %%MM5, %%MM3\n" /*b1 * a2*/
"MOVD 8(%1), %%MM7\n" /*b2*/
"MOVD 12(%1), %%MM6\n" /*b3*/
"PMULUDQ %%MM7, %%MM1\n" /*b2 * a1*/
"PMULUDQ %%MM6, %%MM0\n" /*b3 * a0*/
"PADDQ %%MM2, %%MM3\n"
"MOVD 8(%1), %%MM2\n" /*b2*/
"PADDQ %%MM1, %%MM0\n"
"MOVD 4(%1), %%MM1\n" /*b1*/
"PADDQ %%MM0, %%MM3\n"
"MOVD 0(%1), %%MM0\n" /*b0*/
"MOVQ %%MM3, 88(%2)\n" /* extract result[11] */
"MOVD 0(%0), %%MM4\n" /*a0*/
"MOVQ %%MM0, %%MM3\n" /*b0*/
"MOVD 4(%0), %%MM5\n" /*a1*/
"MOVD 8(%0), %%MM7\n" /*a2*/
"MOVQ %%MM1, %%MM6\n" /*b1*/
"PMULUDQ %%MM5, %%MM3\n" /*a1 * b0*/
"PMULUDQ %%MM4, %%MM6\n" /*a0 * b1*/
"PADDQ %%MM3, %%MM6\n"
"MOVQ %%MM0, %%MM3\n" /*b0*/
"MOVQ %%MM6, 96(%2)\n" /* extract result[12] */
"PMULUDQ %%MM4, %%MM3\n" /*a0 * b0*/
"PMULUDQ %%MM4, %%MM2\n" /*a0 * b2*/
"MOVQ %%MM3, 104(%2)\n" /* extract result[13] */
"MOVQ %%MM1, %%MM6\n" /*b1*/
"PMULUDQ %%MM5, %%MM6\n" /*a1 * b1*/
"MOVQ %%MM0, %%MM3\n" /*b0*/
"PMULUDQ %%MM7, %%MM3\n" /*a2 * b0*/
"PADDQ %%MM2, %%MM6\n"
"PADDQ %%MM6, %%MM3\n"
"MOVD 16(%1), %%MM2\n" /*b4*/
"MOVQ %%MM3, 112(%2)\n" /* extract result[14] */
/* Part #11: The multiplications and additions of
*
*
c += (uint64_t)a[0] * b[4]
+ (uint64_t)a[1] * b[3]
+ (uint64_t)a[2] * b[2]
+ (uint64_t)a[3] * b[1]
+ (uint64_t)a[4] * b[0] */
"PMULUDQ %%MM4, %%MM2\n" /*a0 * b4 */
"MOVD 16(%0), %%MM3\n" /*a4*/
"MOVD 12(%0), %%MM6\n" /* a3*/
"PMULUDQ %%MM0, %%MM3\n" /*b0 * a4 */
"PMULUDQ %%MM1, %%MM6\n" /*b1 * a3 */
"PADDQ %%MM2, %%MM3\n"
"MOVD 12(%1), %%MM2\n" /*b3*/
"PADDQ %%MM3, %%MM6\n"
"MOVD 8(%1), %%MM3\n" /*b2*/
"PMULUDQ %%MM5, %%MM2\n" /*a1 * b3 */
"PMULUDQ %%MM7, %%MM3\n" /*a2 * b2 */
"PADDQ %%MM2, %%MM6\n"
"MOVD 20(%1), %%MM2\n" /*b5*/
"PADDQ %%MM3, %%MM6\n"
"MOVD 20(%0), %%MM3\n" /*a5*/
"MOVQ %%MM6, 120(%2)\n" /* extract result[15] */
/* Part #12: The multiplications and additions of
*
* c += (uint64_t)a[0] * b[5]
+ (uint64_t)a[1] * b[4]
+ (uint64_t)a[2] * b[3]
+ (uint64_t)a[3] * b[2]
+ (uint64_t)a[4] * b[1]
+ (uint64_t)a[5] * b[0] */
"PMULUDQ %%MM4, %%MM2\n" /*a0 * b5 */
"MOVD 16(%0), %%MM6\n" /* a4*/
"PMULUDQ %%MM0, %%MM3\n" /*b0 * a5 */
"PMULUDQ %%MM1, %%MM6\n" /*b1 * a4*/
"PADDQ %%MM2, %%MM3\n"
"MOVD 16(%1), %%MM2\n" /*b4*/
"PADDQ %%MM3, %%MM6\n" /*adding results to mm6*/
"MOVD 12(%1), %%MM3\n" /*b3*/
"PMULUDQ %%MM5, %%MM2\n" /*a1 * b4 */
"PMULUDQ %%MM7, %%MM3\n" /*a2 * b3 */
"MOVD 8(%1), %%MM0\n" /*b2*/
"PADDQ %%MM2, %%MM6\n"
"MOVD 12(%0), %%MM2\n" /* a3*/
"PADDQ %%MM3, %%MM6\n"
"PMULUDQ %%MM0, %%MM2\n" /*b2 * a3 */
"MOVD 24(%0), %%MM3\n" /*a6*/
"MOVD 0(%1), %%MM0\n" /*b0*/
"PADDQ %%MM2, %%MM6\n" /* all additions end up in mm6*/
"MOVD 24(%1), %%MM2\n" /*b6*/
"MOVQ %%MM6, 128(%2)\n" /* extract result[16] */
/* Part #13: The multiplications and additions of
*
* c += (uint64_t)a[0] * b[6]
+ (uint64_t)a[1] * b[5]
+ (uint64_t)a[2] * b[4]
+ (uint64_t)a[3] * b[3]
+ (uint64_t)a[4] * b[2]
+ (uint64_t)a[5] * b[1]
+ (uint64_t)a[6] * b[0]; */
"PMULUDQ %%MM0, %%MM3\n" /*a6 * b0 */
"MOVD 20(%0), %%MM6\n" /* a5*/
"PMULUDQ %%MM4, %%MM2\n" /*b6 * a0 */
"PMULUDQ %%MM1, %%MM6\n" /*a5 * b1*/
"PADDQ %%MM2, %%MM3\n"
"PADDQ %%MM3, %%MM6\n" /*adding all results on mm6*/
"MOVD 20(%1), %%MM2\n" /*b5*/
"MOVD 16(%1), %%MM3\n" /*b4*/
"PMULUDQ %%MM5, %%MM2\n" /*a1 * b5*/
"PMULUDQ %%MM7, %%MM3\n" /*a2 * b4*/
"MOVD 8(%1), %%MM4\n" /*b2*/
"MOVD 12(%1), %%MM1\n" /*b3*/
"PADDQ %%MM2, %%MM6\n"
"MOVD 12(%0), %%MM2\n" /*a3*/
"PADDQ %%MM3, %%MM6\n"
"MOVD 16(%0), %%MM3\n" /*a4*/
"PMULUDQ %%MM1, %%MM2\n" /*b3 * a3 */
"PMULUDQ %%MM4, %%MM3\n" /*b2 * a4 */
"PADDQ %%MM2, %%MM6\n"
"MOVD 4(%1), %%MM1\n" /*b1*/
"MOVD 24(%0), %%MM2\n" /*a6*/
"PADDQ %%MM3, %%MM6\n"
"MOVD 0(%0), %%MM4\n" /*a0*/
"MOVQ %%MM6, 136(%2)\n" /* extract result[17] */
/* Part #14: The multiplications and additions of
*
* c += (uint64_t)a[0] * b[7]
+ (uint64_t)a[1] * b[6]
+ (uint64_t)a[2] * b[5]
+ (uint64_t)a[3] * b[4]
+ (uint64_t)a[4] * b[3]
+ (uint64_t)a[5] * b[2]
+ (uint64_t)a[6] * b[1]
+ (uint64_t)a[7] * b[0]; */
"PMULUDQ %%MM2, %%MM1\n" /*a6 * b1 */
"MOVD 28(%0), %%MM6\n" /*a7*/
"MOVD 28(%1), %%MM3\n" /*b7*/
"PMULUDQ %%MM6, %%MM0\n" /*a7 * b0 */
"PMULUDQ %%MM3, %%MM4\n" /*b7 * a0*/
"MOVD 24(%1), %%MM6\n" /*b6*/
"MOVD 20(%1), %%MM2\n" /*b5*/
"PMULUDQ %%MM6, %%MM5\n" /*b6 * a1 */
"PMULUDQ %%MM2, %%MM7\n" /*b5 * a2 */
"PADDQ %%MM0, %%MM1\n"
"MOVQ 8(%2), %%MM3\n"/*prefetch result1*/
"PADDQ %%MM4, %%MM5\n"
"MOVD 12(%0), %%MM0\n" /*a3*/
"MOVD 8(%1), %%MM6\n" /*b2*/
"PADDQ %%MM1, %%MM5\n"
"MOVD 20(%0), %%MM2\n" /*a5*/
"MOVD 12(%1), %%MM4\n" /*b3*/
"PADDQ %%MM5, %%MM7\n"
"PMULUDQ %%MM2, %%MM6\n" /*a5 * b2 */
"MOVD 16(%0), %%MM1\n" /*a4*/
"MOVD 16(%1), %%MM5\n" /*b4*/
"PMULUDQ %%MM1, %%MM4\n" /*a4 * b3 */
"PMULUDQ %%MM0, %%MM5\n" /*a3 * b4*/
"PADDQ %%MM6, %%MM4\n"
"MOVD 0(%4), %%MM2\n" /*prefetch M to MM2 */
"PADDQ %%MM4, %%MM5\n"
"PADDQ %%MM7, %%MM5\n"
"MOVQ 0(%2), %%MM6\n" /* prefetch d in from result[0]*/
"MOVQ %%MM2, %%MM0\n" /*M secondary storage */
"MOVQ %%MM5, 144(%2)\n" /* extract result[18] */
"MOVQ 104(%2), %%MM7\n" /* c in from result[13] */
"PAND %%MM6, %%MM2\n" /* r[9] = d & M; */
"PSRLQ $26, %%MM6\n" /* d >>= 26; */
"MOVD 4(%4), %%MM4\n" /*R0 to MM4 */
"MOVD %%MM2, 36(%3)\n" /* extract r[9] = d & M; */
"PADDQ %%MM3, %%MM6\n" /* d += (uint64_t)result[1] */
"MOVQ %%MM0, %%MM2\n" /* M back to mm2 */
"MOVQ 16(%2), %%MM3\n" /*prefetch result2*/
"PAND %%MM6, %%MM2\n" /* u0 = d & M; */
"MOVQ %%MM2, %%MM1\n" /*u0 out to temp mm1*/
"PSRLQ $26, %%MM6\n" /* d >>= 26; */
"PSLLQ $10, %%MM1\n" /* R1 * u0 - since R1 equals 1024, it's a shift left by 10 for u0*/
"PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
"PADDQ %%MM3, %%MM6\n" /* d += (uint64_t)result[2] */
"MOVQ 96(%2), %%MM5\n" /* prefetch result12 */
"PADDQ %%MM2, %%MM7\n" /* c = (result from R0*u0) + c */
"MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
"PAND %%MM0, %%MM3\n" /*c & M*/
"PSRLQ $26, %%MM7\n" /* c >>= 26; */
"PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
"MOVD %%MM3, 0(%3)\n" /* exporting t0/r[0] = c & M */
"MOVQ %%MM0, %%MM2\n" /* M */
"PADDQ %%MM5, %%MM7\n" /* c += (uint64_t)result[12] */
"MOVQ 24(%2), %%MM3\n"/*prefetch result3*/
"PAND %%MM6, %%MM2\n" /* u0 = d & M; */
"MOVQ %%MM2, %%MM1\n" /*u0 out to temp mm1*/
"PSRLQ $26, %%MM6\n" /* d >>= 26; */
"PSLLQ $10, %%MM1\n" /* R1 * u0 - since R1 equals 1024, it's a shift left by 10 for u0*/
"PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
"PADDQ %%MM3, %%MM6\n" /* d += (uint64_t)result[3] */
"MOVQ 112(%2), %%MM5\n" /* prefetch result14 */
"PADDQ %%MM2, %%MM7\n" /* c = (result from R0*u0) + c */
"MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
"PAND %%MM0, %%MM3\n" /*c & M*/
"PSRLQ $26, %%MM7\n" /* c >>= 26; */
"PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
"MOVD %%MM3, 4(%3)\n" /* exporting t1/r[1] = c & M */
"MOVQ %%MM0, %%MM2\n" /* M */
"PADDQ %%MM5, %%MM7\n" /* c += (uint64_t)result[14] */
"MOVQ 32(%2), %%MM3\n"/*prefetch result4*/
"PAND %%MM6, %%MM2\n" /* u0 = d & M; */
"MOVQ %%MM2, %%MM1\n" /*u0 out to temp mm1*/
"PSRLQ $26, %%MM6\n" /* d >>= 26; */
"PSLLQ $10, %%MM1\n" /* R1 * u0 - since R1 equals 1024, it's a shift left by 10 for u0*/
"PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
"PADDQ %%MM3, %%MM6\n" /* d += (uint64_t)result[4] */
"MOVQ 88(%2), %%MM5\n" /* prefetch result11 */
"PADDQ %%MM2, %%MM7\n" /* c = (result from R0*u0) + c */
"MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
"PAND %%MM0, %%MM3\n" /*c & M*/
"PSRLQ $26, %%MM7\n" /* c >>= 26; */
"PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
"MOVD %%MM3, 8(%3)\n" /* exporting t2/r[2] = c & M */
"MOVQ %%MM0, %%MM2\n" /* M */
"PADDQ %%MM5, %%MM7\n" /* c += (uint64_t)result[11] */
"MOVQ 40(%2), %%MM3\n"/*prefetch result5*/
"PAND %%MM6, %%MM2\n" /* u0 = d & M; */
"MOVQ %%MM2, %%MM1\n" /*u0 out to temp mm1*/
"PSRLQ $26, %%MM6\n" /* d >>= 26; */
"PSLLQ $10, %%MM1\n" /* R1 * u0 - since R1 equals 1024, it's a shift left by 10 for u0*/
"PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
"PADDQ %%MM3, %%MM6\n" /* d += (uint64_t)result[5] */
"MOVQ 120(%2), %%MM5\n" /* prefetch result15 */
"PADDQ %%MM2, %%MM7\n" /* c = (result from R0*u0) + c */
"MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
"PAND %%MM0, %%MM3\n" /*c & M*/
"PSRLQ $26, %%MM7\n" /* c >>= 26; */
"PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
"MOVD %%MM3, 12(%3)\n" /* exporting t3/r[3] = c & M */
"MOVQ %%MM0, %%MM2\n" /* M */
"PADDQ %%MM5, %%MM7\n" /* c += (uint64_t)result[15] */
"MOVQ 48(%2), %%MM3\n"/*prefetch result6*/
"PAND %%MM6, %%MM2\n" /* u0 = d & M; */
"MOVQ %%MM2, %%MM1\n" /*u0 out to temp mm1*/
"PSRLQ $26, %%MM6\n" /* d >>= 26; */
"PSLLQ $10, %%MM1\n" /* R1 * u0 - since R1 equals 1024, it's a shift left by 10 for u0*/
"PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
"PADDQ %%MM3, %%MM6\n" /* d += (uint64_t)result[6] */
"MOVQ 128(%2), %%MM5\n" /* prefetch result16 */
"PADDQ %%MM2, %%MM7\n" /* c = (result from R0*u0) + c */
"MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
"PAND %%MM0, %%MM3\n" /*c & M*/
"PSRLQ $26, %%MM7\n" /* c >>= 26; */
"PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
"MOVD %%MM3, 16(%3)\n" /* exporting t4/r[4] = c & M */
"MOVQ %%MM0, %%MM2\n" /* M */
"PADDQ %%MM5, %%MM7\n" /* c += (uint64_t)result[16] */
"MOVQ 56(%2), %%MM3\n"/*prefetch result7*/
"PAND %%MM6, %%MM2\n" /* u0 = d & M; */
"MOVQ %%MM2, %%MM1\n" /*u0 out to temp mm1*/
"PSRLQ $26, %%MM6\n" /* d >>= 26; */
"PSLLQ $10, %%MM1\n" /* R1 * u0 - since R1 equals 1024, it's a shift left by 10 for u0*/
"PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
"PADDQ %%MM3, %%MM6\n" /* d += (uint64_t)result[7] */
"MOVQ 136(%2), %%MM5\n" /* prefetch result16 */
"PADDQ %%MM2, %%MM7\n" /* c = (result from R0*u0) + c */
"MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
"PAND %%MM0, %%MM3\n" /*c & M*/
"PSRLQ $26, %%MM7\n" /* c >>= 26; */
"PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
"MOVD %%MM3, 20(%3)\n" /* exporting t5/r[5] = c & M */
"MOVQ %%MM0, %%MM2\n" /* M */
"PADDQ %%MM5, %%MM7\n" /* c += (uint64_t)result[16] */
"MOVQ 64(%2), %%MM3\n"/*prefetch result8*/
"PAND %%MM6, %%MM2\n" /* u0 = d & M; */
"MOVQ %%MM2, %%MM1\n" /*u0 out to temp mm1*/
"PSRLQ $26, %%MM6\n" /* d >>= 26; */
"PSLLQ $10, %%MM1\n" /* R1 * u0 - since R1 equals 1024, it's a shift left by 10 for u0*/
"PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
"PADDQ %%MM3, %%MM6\n" /* d += (uint64_t)result[8] */
"MOVQ 144(%2), %%MM5\n" /* prefetch result18 */
"PADDQ %%MM2, %%MM7\n" /* c = (result from R0*u0) + c */
"MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
"PAND %%MM0, %%MM3\n" /*c & M*/
"PSRLQ $26, %%MM7\n" /* c >>= 26; */
"PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
"MOVD %%MM3, 24(%3)\n" /* exporting t6/r[6] = c & M */
"MOVQ %%MM0, %%MM2\n" /* M */
"PADDQ %%MM5, %%MM7\n" /* c += (uint64_t)result[18] */
"MOVQ 72(%2), %%MM3\n"/*prefetch result9*/
"PAND %%MM6, %%MM2\n" /* u0 = d & M; */
"MOVQ %%MM2, %%MM1\n" /*u0 out to temp mm1*/
"PSRLQ $26, %%MM6\n" /* d >>= 26; */
"PSLLQ $10, %%MM1\n" /* R1 * u0 - since R1 equals 1024, it's a shift left by 10 for u0*/
"PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
"PADDQ %%MM3, %%MM6\n" /* d += (uint64_t)result[9] */
"MOVQ 80(%2), %%MM5\n" /* prefetch result10 */
"PADDQ %%MM2, %%MM7\n" /* c = (result from R0*u0) + c */
"MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
"PAND %%MM0, %%MM3\n" /*c & M*/
"PSRLQ $26, %%MM7\n" /* c >>= 26; */
"PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
"MOVD %%MM3, 28(%3)\n" /* exporting t7/r[7] = c & M */
"MOVQ %%MM0, %%MM2\n" /* M */
"PADDQ %%MM5, %%MM7\n" /* c += (uint64_t)result[10] */
"PAND %%MM6, %%MM2\n" /* u0 = d & M; */
"MOVQ %%MM2, %%MM1\n" /*u0 out to temp mm1*/
"PSRLQ $26, %%MM6\n" /* d >>= 26; */
"PSLLQ $10, %%MM1\n" /* R1 * u0 - since R1 equals 1024, it's a shift left by 10 for u0*/
"PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
"PADDQ %%MM2, %%MM7\n" /* c = (result from R0*u0) + c */
"MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
"PAND %%MM0, %%MM3\n" /*c & M*/
"PSRLQ $26, %%MM7\n" /* c >>= 26; */
"MOVQ %%MM0, %%MM2\n" /*cloning M to mm2*/
"PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
"MOVD %%MM3, 32(%3)\n" /* exporting t8/r[8] = c & M */
"PMULUDQ %%MM6, %%MM4\n" /* d * R0 */
"PSRLQ $4, %%MM0\n" /* M >>= 4 */
"MOVD 36(%3), %%MM1\n" /* R[9] in */
"PADDQ %%MM4, %%MM7\n" /* c+= d * R0 */
"PADDQ %%MM1, %%MM7\n" /* c+= r[9] ===> c+= d * R0 + r[9]*/
"PAND %%MM7, %%MM0\n" /* c & (M >> 4) */
"MOVD %%MM0, 36(%3)\n" /* r[9] = c & (M >> 4) */
"PSRLQ $22, %%MM7\n" /* c >>= 22 */
"PSLLQ $14, %%MM6\n" /* d * (R1 << 4). Since (R1 << 4) equals 16384, it's essentially a left shift by 14 */
"PADDQ %%MM6, %%MM7\n" /* c += d * (R1 << 4) */
"MOVQ %%MM7, %%MM3\n" /*cloning c*/
"PSLLQ $6, %%MM7\n" /* result of c * (R1 >> 4) which equals c shifted left by 6, since (R1 >> 4) = 64 */
"MOVQ %%MM3, %%MM0\n" /*this is a manual attempt at multiplying c with x3D1 or 977 decimal, by shifting and adding copies of c ...*/
"MOVQ %%MM3, %%MM1\n" /*all this segment, is, in reality, just a (c*977) single line multiplication */
"MOVQ %%MM3, %%MM6\n" /* which for some reason doesn't want to work otherwise with a plain PMULUDQ c * 977 constant */
"MOVQ %%MM3, %%MM4\n"
"MOVQ %%MM3, %%MM5\n"
"PSLLQ $9, %%MM0\n" /* x512 */
"PSLLQ $8, %%MM1\n" /* x256 */
"PSLLQ $7, %%MM6\n" /* x128 */
"PSLLQ $6, %%MM4\n" /* x64 */
"PSLLQ $4, %%MM5\n" /* x16 */ /*512+256+128+64 = 976x, so +1 add on top =977 or 0x3D1 */
"PADDQ %%MM3, %%MM0\n"
"PADDQ %%MM1, %%MM6\n"
"MOVD 0(%3), %%MM3\n" /*prefetch r[0] to MM3 */
"PADDQ %%MM4, %%MM0\n"
"PADDQ %%MM6, %%MM0\n"
"PADDQ %%MM0, %%MM5\n" /* result of c * (R0 >> 4) */
"PADDQ %%MM3, %%MM5\n" /* d = r[0] + c (R0 >> 4) */
"MOVD 4(%3), %%MM4\n" /*r[1] to MM4 */
"MOVD 8(%3), %%MM0\n" /*r[2] to MM5 */
"MOVQ %%MM5, %%MM3\n" /*cloning d */
"PAND %%MM2, %%MM5\n" /*d&M*/
"PSRLQ $26, %%MM3\n" /* d >>= 26 */
"PADDQ %%MM7, %%MM4\n" /* c * (R1 >> 4) + r[1] */
"PADDQ %%MM4, %%MM3\n" /*d += c * (R1 >> 4) + r[1]; */
"MOVD %%MM5, 0(%3)\n" /* export d to r[0] */
"MOVQ %%MM3, %%MM7\n" /*cloning d */
"PAND %%MM2, %%MM7\n" /*d&M*/
"PSRLQ $26, %%MM3\n" /* d >>= 26 */
"PADDQ %%MM0, %%MM3\n" /*d += r[2];*/
"MOVD %%MM7, 4(%3)\n" /*r[1] = d & M; */
"MOVD %%MM3, 8(%3)\n" /*r[2]=d;*/
"EMMS\n"
:
: "q"(a), "q"(b), "q"(result), "q"(r), "S"(tempstor)
: "memory", "%mm0", "%mm1", "%mm2", "%mm3", "%mm4", "%mm5", "%mm6", "%mm7"
);
}