Author

Topic: How to use properly secp256k1 library (Read 2497 times)

legendary
Activity: 1932
Merit: 2077
December 09, 2017, 05:09:17 PM
#18
The overcomplete representation reduces the number of carries needed between limbs, so it makes the operations faster.  The fe_add should already be more or less optimal when written in C-- it's a trivial function in part because it doesn't need to handle any carrying at all.

It's possible that some other representation might be faster yet, esp if using SIMD.

In the case of the fe_add I understand that with the overcomplete representation there are no carries (but one more addition).

In the case of fe_mul? 64x64 bit = 128 bit, and a*b + c + d is always less than 2^128 already.
Using the 52x52 representation, 52x52 is 104 bit (still 2 uint64), the only advantage is that a*b + c + d + f < 2^128? But there are 5x5 = 25 multiplications against 4x4 =16 multiplications.

Besides there is the reduction mod p, and being p almost equal to 2^256 it seems to me that should be faster with the 4x64 bit representation. Using the 4x64 representation, it is enough to multiply the high 4 limbs of the product by 0x1000003d1 and add the result to the low 4 limbs.
staff
Activity: 4284
Merit: 8808
December 09, 2017, 04:13:11 PM
#17
Reviewing the libsecp256k1 now I understand better the code. I noted that for the scalar there is a representation 4x64 bit, while for the field element only 5x52 (or 10x26). There are some safety reason or these representations make effectively faster the operations? Why there is only an asm version of the fe_mul and not of the fe_add too?

The overcomplete representation reduces the number of carries needed between limbs, so it makes the operations faster.  The fe_add should already be more or less optimal when written in C-- it's a trivial function in part because it doesn't need to handle any carrying at all.

It's possible that some other representation might be faster yet, esp if using SIMD.
legendary
Activity: 1708
Merit: 1049
December 09, 2017, 03:55:33 PM
#16
Why there is only an asm version of the fe_mul and not of the fe_add too?

From what I've seen in profiling, fe_add is rarely the bottleneck, so probably that's the reason. It's too small. It should also be able to compile depending the available hardware and get sse/avx speedup by doing multiple adds per instruction.

Quote
I'd like to know besides if someone know if it is possible to exploit avx instructions to speedup operations with 64 bit representation.

The main issue with avx is the lack of 128 bit multiplication and 128 bit addition.

That's not a problem though for the 32 bit field because there are 64 bit muls/adds (which can be handled). In theory you could do up to 8/16 32bit field results in parallel on the same thread with avx2/512 - and go much faster than the 64 bit, but that requires rewrite of all peripheral code because you need to feed it multiple "a" and "b" to get multiple "r" - which of course have to be used by other functions beyond mul/sqr, etc - which would also need to work in a stream-like fashion that deal with multiple inputs/outputs instead of one.

(My focus has been on using avx2/512 on a single 32-bit field result (one a, one b, one r) to get it faster, which avoids the rewrite part - and since the code after the multiplications and additions is more sequential it fails to produce much of a speedup. However proper speedup of avx2/512 is dependent on using multiple "streams" of inputs and outputs).
legendary
Activity: 1932
Merit: 2077
December 09, 2017, 02:47:40 PM
#15
After some months of work, eventually I did a my library to implement some functions of ECC. I didn't use GMP anymore.

My purpose is only to generate as fast as possible public keys from private keys. Therefore I use consecutive private keys.

About the field operation, I did a my version of fe_mul and fe_sqr. Each field element is simply an array of 4 uint64. The performance are pretty good. I don't have a benchmark precise like AlexGR, I can say that with my mobile Intel Xeon E3-1505M v6 KabyLake I got (using only 1 core) about 16.5 M PublicKeys / s.

Reviewing the libsecp256k1 now I understand better the code. I noted that for the scalar there is a representation 4x64 bit, while for the field element only 5x52 (or 10x26). There are some safety reason or these representations make effectively faster the operations? Why there is only an asm version of the fe_mul and not of the fe_add too?

I'd like to know besides if someone know if it is possible to exploit avx instructions to speedup operations with 64 bit representation.
legendary
Activity: 1708
Merit: 1049
August 07, 2017, 03:27:35 PM
#14
I may try this again in the future with a packed version on ymm/zmm registers. In a packed manner it might even trounce the 5x52 field. We'll see how it goes...

I did fe_mul_inner of 10x26 on avx-512 to check it out. Machine used was a google cloud skylake xeon @ 2ghz.

Baseline performance for fe_mul in ./bench_internal is around

~30us for the asm shipped (5x52)
~27us with no asm used (5x52) and the compiler (gcc) issuing MULXs - which is more efficient than the embedded asm. (That's true only for fe_mul and fe_sqr, the scalar asm is still better.)
~51us for the 10x26 field written in C
~49us for the 10x26 field written in C but using SSE2 to pack the 100 muls into 50 muls.
~38us for the 10x26 field written in C except if I do all the multiplications in an AVX2 array, packed by 4.
~45us for the 10x26 field written in ASM, using all AVX-512.
~38us for the 10x26 field written in ASM, using AVX-512 for muls+adds and then doing the rest in the scalar part (this is what is posted below)
~29us for the 10x26 field written in ASM, except a few lines in c (which don't change much - as 90%+ is in asm), but then doing a -fprofile-generate and -fprofile-use, which somehow gets it +35% faster Roll Eyes

There's something definitely holding back AVX-512 from going faster than 27us which is the mulx standard, but I'll have to see how it goes in other architectures and non-VM machines also.

There are 3 factors:

-Whether VM performance and VM use in overbooked machines tends to create abnormal performance scenarios. I've seen instructions in the profiler stalling for 10-15% of the whole function time and I'm like wtf is this Roll Eyes Roll Eyes Roll Eyes

-Scalar code gets a CPU boost (2.1-2.7 GHz instead of baseline 2GHz) and AVX-512 code gets an underclock (1.8GHz). From 1.8GHz (AVX-512) to 2.7GHz (scalar code running in integer unit with turbo) that's a +50% clock difference. That kind of perf hit wasn't part of the plan when thinking about packing stuff together Undecided

-The presence of 1 or 2 AVX 512 units per core (??). From what I've read there are differences between i7/i9, xeon gold/platinum etc etc.. cat /proc/cpuinfo is a bit elusive).



The following code (asm 10x26) is in the 38us range in the VM - which is slower than the ASM 5x52 (30us) or gcc-mulx version 5x52 (27us). I've left a lot of comments intact instead of clearing them out - as they were things which I was phasing out by replacing them with asm. I initially started with an array (aaa[10][10]) to store the results and then gradually replaced mem references to the registers themselves, then did the additions from the registers and eventually stored all the results in another array. It's not very clean written nor was it supposed to be as I haven't found the performance I expected.

In total, 100 muls, 81 additions, loading everything from RAM, storing the results and exporting C + D initial values were managed in something like ...75 instructions (!), of which memory operations were very few. And the muls used are 20, instead of 13 (optimum) but the 13 scenario involves more packing and unpacking, so I left it as it is - seeing in the profiler that muls have low cost (2 issued per cycle).

VALIGNs were slow, but they were better than writing everything on to memory and then loading from the right addresses. The 2 addition vectors were interleaved to do the additions in parallel (again it can take 2 adds per cycle). Readability suffers a bit on this part as a consequence. Interleaving can also be done with broadcasts and alignments but I left it as is.

The code needs clobbering for xmm16-xmm31, but it played ok with gcc. Clang crashes (it already uses xmm16-xmm31 elsewhere and doesn't expect them to be rewritten). Clobbering requires special treatment for these, with IF definitions that indicate avx-512 presence in the preprocessor.

I'm curious to see what it'll do in, say, a i9 running on a desktop, or some future avx512 machine which doesn't underclock avx-512 action / overclocks scalar action for a 30-50% diff. Anyway, I guess I had too much time on my hands Tongue (I think I've written around 130 variations of this by now trying to find the elusive avx2 and avx512 performance, lol).

Code:
SECP256K1_INLINE static void secp256k1_fe_mul_inner(uint32_t *r, const uint32_t *a, const uint32_t * SECP256K1_RESTRICT b) {
/*    uint64_t c, d;*/
/*    uint64_t u0;*/
/*    uint32_t M, R0;*/
/*  uint64_t aaa[10][10] __attribute__ ((aligned (64)));  */    /*the initial "pass" was every mul result getting stored in the 10x10 array. It was replaced with the addition result array - keeping everything in registers*/
    uint64_t aaa[18] __attribute__ ((aligned (64)));  
/*    int ii;*/
/*    const uint32_t M = 0x3FFFFFFUL, R0 = 0x3D10UL, R1 = 0x400UL;*/

  

 __asm__ (  
 

"VPSHUFD $0b00010000, 32(%1), %%XMM5\n" /*b8 to b9 loaded*/
"VPMOVZXDQ 0(%1), %%ZMM1\n" /*b0 to b7 expanded to quadwords and placed accordingly*/

"VPBROADCASTD 0(%0), %%ZMM0\n" /*a0 loaded into zmm*/
"VPBROADCASTD 4(%0), %%ZMM2\n" /*a1 loaded into zmm*/
"VPBROADCASTD 8(%0), %%ZMM8\n" /*a2 loaded into zmm*/
"VPBROADCASTD 12(%0), %%ZMM9\n" /*a3 loaded into zmm*/
"VPBROADCASTD 16(%0), %%ZMM10\n" /*a4 loaded into zmm*/
"VPBROADCASTD 20(%0), %%ZMM20\n" /*a5 loaded into zmm*/
"VPBROADCASTD 24(%0), %%ZMM22\n" /*a6 loaded into zmm*/
"VPBROADCASTD 28(%0), %%ZMM28\n" /*a7 loaded into zmm*/
"VPBROADCASTD 32(%0), %%ZMM29\n" /*a8 loaded into zmm*/
"VPBROADCASTD 36(%0), %%ZMM30\n" /*a9 loaded into zmm*/

"VPMULUDQ %%ZMM0, %%ZMM1, %%ZMM3\n" /*a[0][0] to a[0][7]*/
"VPMULUDQ %%ZMM2, %%ZMM1, %%ZMM4\n"  /*a[1][0] to a[1][7]*/
"VPMULUDQ %%ZMM8, %%ZMM1, %%ZMM11\n"  /*a[2][0] to a[2][7]*/
"VPMULUDQ %%ZMM9, %%ZMM1, %%ZMM13\n"  /*a[3][0] to a[3][7]*/
"VPMULUDQ %%ZMM10, %%ZMM1, %%ZMM15\n"  /*a[4][0] to a[4][7]*/

"VPMULUDQ %%XMM0, %%XMM5, %%XMM6\n"   /*a[0][8] to a[0][9]*/
"VPMULUDQ %%XMM2, %%XMM5, %%XMM7\n"   /*a[1][8] to a[1][9]*/
"VPMULUDQ %%XMM8, %%XMM5, %%XMM12\n"   /*a[2][8] to a[2][9]*/
"VPMULUDQ %%XMM9, %%XMM5, %%XMM14\n"   /*a[3][8] to a[3][9]*/
"VPMULUDQ %%XMM10, %%XMM5, %%XMM16\n"  /*a[4][8] to a[4][9]*/

"VPMULUDQ %%XMM20, %%XMM5, %%XMM24\n"    /*a[5][8] to a[5][9]*/
"VPMULUDQ %%XMM22, %%XMM5, %%XMM25\n"    /*a[6][8] to a[6][9]*/
"VPMULUDQ %%XMM28, %%XMM5, %%XMM26\n"  /*a[7][8] to a[7][9]*/
"VPMULUDQ %%XMM29, %%XMM5, %%XMM27\n"  /*a[8][8] to a[8][9]*/
"VPMULUDQ %%XMM30, %%XMM5, %%XMM31\n"  /*a[9][8] to a[9][9]*/

"VPMULUDQ %%ZMM20, %%ZMM1, %%ZMM17\n"   /*a[5][0] to a[5][7]*/
"VPMULUDQ %%ZMM22, %%ZMM1, %%ZMM18\n"   /*a[6][0] to a[6][7]*/
"VPMULUDQ %%ZMM28, %%ZMM1, %%ZMM19\n" /*a[7][0] to a[7][7]*/
"VPMULUDQ %%ZMM29, %%ZMM1, %%ZMM21\n" /*a[8][0] to a[8][7]*/
"VPMULUDQ %%ZMM30, %%ZMM1, %%ZMM23\n" /*a[9][0] to a[9][7]*/



/* There are two main addition vector sequences.

The load is vertical (eg: Vector 1 loads aaa90 to aaa97, representing a[9]*b[0] to a[9]*b[7]) and the addition horizontal to the right. The horizontal adding requires
zeroes in the register for the elements that require no further adding, or a k mask in the register to limit the fields added. In this case we're pulling zeroes from zmm28).
 

90+81 72 63 54 45 36 27 18 9   (result stored in aaa[9][0] if using a mem matrix)
91+82 73 64 55 46 37 28 19     (result stored in aaa[9][1] if using a mem matrix)
92+83 74 65 56 47 38 29     (result stored in aaa[9][2] if using a mem matrix)
93+84 75 66 57 48 39     (result stored in aaa[9][3] if using a mem matrix)  
94+85 76 67 58 49     (result stored in aaa[9][4] if using a mem matrix)
95+86 77 68 59     (result stored in aaa[9][5] if using a mem matrix)
96+87 78 69     (result stored in aaa[9][6] if using a mem matrix)
97+88 79     (result stored in aaa[9][7] if using a mem matrix)

Second sequence is of this form:

1+10                         (result stored in aaa[0][1] if using a mem matrix)
2+11       20                     (result stored in aaa[0][2] if using a mem matrix)
3+12       21 30                     (result stored in aaa[0][3] if using a mem matrix)
4+13       22 31 40                     (result stored in aaa[0][4] if using a mem matrix)
5+14       23 32 41 50                     (result stored in aaa[0][5] if using a mem matrix)
6+15       24 33 42 51 60                     (result stored in aaa[0][6] if using a mem matrix)
7+16       25 34 43 52 61 70                     (result stored in aaa[0][7] if using a mem matrix)
8+17       26 35 44 53 62 71 80                  (result stored in aaa[0][8] if using a mem matrix)

Again adding is to the right. The later results overwrite the 1st vertical vector (if they are non-zero-adds) - which keeps all the results.

The only result which is left out of the vector additions is the addition of 89+98 which has to be done unpacked.
00 and 99 wiil be used directly and aren't added.

In all there are
1) 44 adds in the first vector done as 9 vector adds.
2) 36 adds in the second vector done as 8 vector adds.

Total 80 additions with a packing rate of 4.7x. Plus an unpacked one (89+98), we end up with 18 adds for 81 additions (4.5x packing rate).*/

"VPSRLDQ $8, %%XMM27, %%XMM8\n"
"VPXORQ %%YMM28, %%YMM28, %%YMM28\n" /*zero in order to pull zeroes in the empty fields of the vector*/
"VALIGNQ $1, %%ZMM3, %%ZMM6, %%ZMM30\n" /*1 to 8*/
"VPADDQ %%XMM31, %%XMM8, %%XMM8\n" /*a[9][8] to a[9][9]   with 98 being 98+89*/
"VALIGNQ $7, %%ZMM28, %%ZMM11, %%ZMM1\n" /*20 to 26 - want to add 20-21-22-23-24-25-26 from second element onwards*/

"VMOVDQA64 %%XMM8, 128(%2)\n"  /*a[9][8] to a[9][9]   with 98 being 98+89*/

"VALIGNQ $6, %%ZMM28, %%ZMM13, %%ZMM2\n" /*30 to 35 - want to add 30-31-32-33-34-35 from third element onwards*/
"VALIGNQ $5, %%ZMM28, %%ZMM15, %%ZMM0\n" /*40 to 44 - want to add 40-41-42-43-44 from fourth element onwards*/
"VALIGNQ $4, %%ZMM28, %%ZMM17, %%ZMM8\n" /*50 to 53 - want to add 50-51-52-53 from fifth element onwards*/
"VALIGNQ $3, %%ZMM28, %%ZMM18, %%ZMM10\n" /*60 to 62 - want to add 60-61-62 from sixth element onwards*/
"VALIGNQ $2, %%ZMM28, %%ZMM19, %%ZMM20\n" /*70 to 71 - want to add 70-71 from seventh element onwards*/
"VALIGNQ $1, %%ZMM28, %%ZMM21, %%ZMM22\n" /*80 - want to add 80 only on the eighth element*/

"VALIGNQ $1, %%ZMM21, %%ZMM27, %%ZMM27\n" /*81 to 88*/
"VALIGNQ $2, %%ZMM19, %%ZMM26, %%ZMM26\n" /*72 to 79*/
"VALIGNQ $3, %%ZMM18, %%ZMM25, %%ZMM25\n" /*63 to 69*/
"VALIGNQ $4, %%ZMM17, %%ZMM24, %%ZMM29\n"  /*54 to 59*/
"VALIGNQ $5, %%ZMM15, %%ZMM16, %%ZMM16\n" /*45 to 49*/
"VALIGNQ $6, %%ZMM13, %%ZMM14, %%ZMM14\n" /*36 to 39*/
"VALIGNQ $7, %%ZMM11, %%ZMM12, %%ZMM12\n" /*27 to 29*/
/*18 and 19 left as is on zmm7*/
/*"VALIGNQ $1, %%ZMM6, %%ZMM6, %%ZMM24\n" */  /*9*/
"VPSRLDQ $8, %%YMM6, %%YMM24\n"     /*9*/


"VPADDQ %%ZMM27, %%ZMM23, %%ZMM5\n"  /*add on them aaa81 to aaa88*/
"VPADDQ %%ZMM26, %%ZMM25, %%ZMM25\n"  /*add on them aaa72 to aaa79*/
"VPADDQ %%ZMM29, %%ZMM16, %%ZMM16\n"  /*add on them aaa54 to aaa59*/
"VPADDQ %%ZMM25, %%ZMM5, %%ZMM5\n" /*add on them aaa63 to aaa69*/
"VPADDQ %%ZMM4, %%ZMM30, %%ZMM30\n"  /*add on them aaa10 to aaa17*/
"VPADDQ %%ZMM16, %%ZMM5, %%ZMM5\n"  /*add on them aaa45 to aaa49*/
"VPADDQ %%ZMM12, %%ZMM7, %%ZMM7\n"  /*add on them aaa27 to aaa29*/
"VPADDQ %%ZMM1, %%ZMM30, %%ZMM30\n"  /*add on them aaa20 to aaa26*/
"VPADDQ %%ZMM14, %%ZMM5, %%ZMM5\n"  /*add on them aaa36 to aaa39*/
"VPADDQ %%ZMM2, %%ZMM30, %%ZMM30\n"  /*add on them aaa30 to aaa35*/
"VPADDQ %%ZMM7, %%ZMM5, %%ZMM5\n"  /*add on them aaa18 to aaa19*/
"VPADDQ %%ZMM0, %%ZMM8, %%ZMM8\n"  /*add on them aaa40 to aaa44*/
"VPADDQ %%ZMM10, %%ZMM20, %%ZMM20\n"  /*add on them aaa60 to aaa62*/
"VPADDQ %%ZMM24, %%ZMM5, %%ZMM5\n"  /*add on them aaa9*/
"VPADDQ %%ZMM8, %%ZMM30, %%ZMM30\n"  /*add on them aaa50 to aaa53*/
"VMOVDQA64 %%ZMM5, 64(%2)\n"  /*aaa90 - aaa97 on aaa[8] to aaa[15] */
"VPADDQ %%ZMM22, %%ZMM20, %%ZMM20\n"  /*add 22 on 20 instead, and then on 30, for less serial action */
"VPADDQ %%ZMM20, %%ZMM30, %%ZMM30\n"

"VMOVDQA64 %%ZMM30, 0(%2)\n"

"movq %%xmm5, %%rdx\n" /*d  =  (uint64_t)aaa[9][0]*/
"movq %%xmm3, %%rcx\n" /*  c  = (uint64_t)aaa[0][0]*/  

  "mov $0x3FFFFFF, %%r11d\n" /*M*/
  "movq %%rdx, %%r9\n"
  "xor %%ebx, %%ebx\n"    
  
  "shrq $26, %%rdx\n"
  "and %%r11d, %%r9d\n" /*t9=d&M*/
  "addq 72(%2), %%rdx\n"  /* d=d+(uint64_t)aaa[9]; */
  
  
    
  "loop_top_%=:\n"  
    
  "mov %%r11d, %%r10d\n" /*M*/
  "add $4, %%ebx\n" /*used for loop counter and memory addresses as a multiplier*/
  "and %%edx, %%r10d\n" /*   u0 = d & M; */
  "mov %%r10d, %%r8d\n" /*   u0 = d & M; */
  "imul $0x3D10, %%r10\n" /* u0 * R0*/
  "mov %%r11d, %%eax\n" /*M*/  
  "shrq $26, %%rdx\n" /* d >>= 26 */
  "shlq $10, %%r8\n" /* u0 * R1*/
  "addq %%r10, %%rcx\n" /* c += u0 * R0; */
  "and %%ecx, %%eax\n" /*   c & M; */
  "shrq $26, %%rcx\n" /*c >>= 26*/
  "mov %%eax, -4(%3, %%rbx, 1)\n" /* r[ii] = c & M; */
  "addq %%r8, %%rcx\n" /* c += u0 * R1 */
  "cmp $36, %%ebx\n"    /*from r[0] to [r7] there are some extra additions, while in r[8] there isn't - so when it reaches r[8] it skips them and ends*/
  
  "je loop_end_%=\n"
  "addq 72(%2, %%rbx, 2), %%rdx\n" /*     d +=(uint64_t)aaa[1][ii+2];     */
  "addq -8(%2, %%rbx, 2), %%rcx\n"     /*  c += (uint64_t)aaa[0][ii];     */
   "jmp loop_top_%=\n"

   "loop_end_%=:\n"  

  
   "imulq $0x3D10, %%rdx, %%r8\n"
   "addq %%r9, %%rcx\n"
   "mov $0x3FFFFF, %%r10d\n"
   "addq %%r8, %%rcx\n"    /* c   += d * R0 + r[9];*/
  
   "andq %%rcx, %%r10\n"
   "shlq $14, %%rdx\n"
   "mov %%r10d, 36(%3)\n"  /* r[9] = c & (M >> 4);  */
  
   "shrq $22, %%rcx\n"
   "addq %%rdx, %%rcx\n"  /* c >>= 22; c += d * (R1 << 4); */
  
   "imulq $0x3D1, %%rcx, %%rdx\n"
   "mov 0(%3), %%r10d\n"
   "mov %%r11d, %%r8d\n"
   "addq %%r10, %%rdx\n"   /*   d    = c * (R0 >> 4) + r[0];*/

   "shlq $6, %%rcx\n"  
   "andq %%rdx, %%r8\n"
   "mov 4(%3), %%r10d\n"
   "mov %%r8d, 0(%3)\n"    /*  r[0] = d & M; */
  
   "shrq $26, %%rdx\n"  /*d >>= 26;*/
   "addq %%r10, %%rcx\n"
   "addq %%rcx, %%rdx\n"    /*  d   += c * (R1 >> 4) + r[1];*/
  
   "and %%edx, %%r11d\n"
   "mov %%r11d, 4(%3)\n"   /*  r[1] = d & M;*/
  
   "shrq $26, %%rdx\n"  /* d >>= 26;*/
   "add %%edx, 8(%3)\n" /*    r[2] += d;*/


:
: "q"(a), "q"(b), "q"(aaa), "q"(r)
: "memory", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11","%xmm0", "%xmm1", "%xmm2", "%xmm3", "%xmm4", "%xmm5", "%xmm6", "%xmm7", "%xmm8", "%xmm9", "%xmm10", "%xmm11", "%xmm12", "%xmm13", "%xmm14", "%xmm15"
);
 
/*        d  =  (uint64_t)aaa[0][9]*/
  /*     + (uint64_t)aaa[1][8]*/
   /*  + (uint64_t)aaa[2][7]*/
    /*   + (uint64_t)aaa[3][6]*/
    /*   + (uint64_t)aaa[4][5] */
    /*   + (uint64_t)aaa[5][4]*/
    /*   + (uint64_t)aaa[6][3]*/
    /*   + (uint64_t)aaa[7][2]*/
    /*   + (uint64_t)aaa[8][1]*/
     /*  +  (uint64_t)aaa[9][0];*/

    
 /*  r[9] = d & M; */ /*d >>= 26; */

 /*   c  = (uint64_t)aaa[0][0];*/

  /*  d += */ /*(uint64_t)aaa[1][9]*/
   /*  + (uint64_t)aaa[2][8]*/
   /*    + (uint64_t)aaa[3][7]*/
    /*   + (uint64_t)aaa[4][6] */
    /*   + (uint64_t)aaa[5][5]*/
   /*    + (uint64_t)aaa[6][4]*/
 /*      + (uint64_t)aaa[7][3]*/
  /*     + (uint64_t)aaa[8][2]*/
    /*   + */ /* (uint64_t)aaa[9]; */  /*aaa[9][1]=>aaa[1][1]=>aaa[9]*/

/*    for( ii = 0; ii < 8; ii = ii + 1 )

         {              
    u0 = d & M; d >>= 26; c += u0 * R0;
    r[ii] = c & M; c >>= 26; c += u0 * R1;
    c += (uint64_t)aaa[0][ii];    
    d +=(uint64_t)aaa[1][ii+2];        
           }    */
    
/*    u0 = d & M; d >>= 26; c += u0 * R0;
    r[8] = c & M; c >>= 26; c += u0 * R1;*/

    
/*    c   += d * R0 + r[9];*/
 /*    r[9] = c & (M >> 4);   c >>= 22; c += d * (R1 << 4); */
  

 /*   d    = c * (R0 >> 4) + r[0];*/

 /*   r[0] = d & M;  d >>= 26; */
    
  /*  d   += c * (R1 >> 4) + r[1];*/

  /*  r[1] = d & M;*/ /*d >>= 26;*/
/*    r[2] += d;*/
}


legendary
Activity: 2128
Merit: 1073
April 28, 2017, 01:31:30 PM
#13
Right, using a using an integer multiply is needlessly slow but thats just the right idea.
Well, what I had in mind was not ADD/MUL instructions but addition and multiplication in a commutative field. From the way the posters in this thread use the word endomorphism, I presume that they were vaccinated for it and have immunity. Wink
 
(If you're curious, here is how we accomplish those conditional accesses, we loop over the entries with https://github.com/bitcoin-core/secp256k1/blob/master/src/field_5x52_impl.h#L419 ).
It is more or less a software re-implementation of what CPU hardware does when accessing cache lines. Except that the hardware also does ECC check after read and re-computes ECC before write, even on the cheapest processors. (ECC being an Error-Correcting Code.)

Also this code looks like a good candidate to try use the new BMI (Bit Manipulation Instructions) available in the newer x86 processors.

Well we  wouldn't care about vectorizing signing in any case-- at least in Bitcoin land signing many messages at once is never a bottleneck (and really if you're designing something new where it is, merkelize your data and do a single signature (perhaps with an aggregated key).  Smiley
It is more of a view from your chair, where you don't see the necessity of easy vectorization. As an outside observer, not personally involved in the current battles, I can immediately think of two areas:

1) mining/pool server code is very much likely to benefit from vectorization (possibly variable-length vectors)
2) code embedded in portable/dedicated hardware or otherwise subject to either random hardware faults or intentional fault injection to extract secrets. In that case fixed-length vectors are the savior:
2a) 2-way SIMD where the payload consists of a one new signature and one known signature (from the list of benchmark/test problems). This is the best way to operate in the presence of random faults.
2b) 4-way SIMD where they payload is double of the above, each signing operation duplicated and then verified for equality at conclusion. This is the best way to detect intentional tampering via hardware faults.

I will not break too many NDAs by stating that many of those secret, highly secure processors are internally looking like "Z80 with SIMD" as a way to make the hardware and its programs hard to crack. For anyone working with no NDAs, targeting ARM with NEON is an excellent benchmark for writing and testing their code.

But lets say verification: At least on x86 style SIMD efficiently doing a fast table lookup (which is needed to exploit precomputation for state of the art performance) is not easily accomplished.   I haven't checked lately, FWIW, but by CPU vanity gen is as was previously just as fast at the GPU vanity gens due to exploiting optimizations like extensive precomputation.   Though it's possible to use pshufb to access small tables, big tables are another issue. Regardless, there is no way the compiler is going to do anything too interesting there on its own, beyond perhaps optimizing some copying loops that would better be eliminated.

(Not that I would discourage someone from trying, it would be an learning experience even if it failed)
I don't disagree with you, but I've observed that there were many unexplored avenues for further hardware-dependent optimization. In particular everyone seems to be sort of stuck at essentially one representation, where all the words of the a bignum are next to each other in memory. From my experience doing the two matrix transposes (one before and one after the actual EC computation) could possibly be a net gain. The intermediate representation would then have all the least significant words of bignums next to each other in memory)
Our use of types already means that the compilers existing assumption of the aliasing rule (no pointers of distinct types may alias each other except with char) already gets a lot of restrict's gain, as does the fact that everything gets build in a single compilation unit so the compiler can _see_ the actual non-aliasing of most things... I would love to use restrict but until about a month ago the situation with GCC was that restrict was blindly heeded, even when the compiler could _directly_ see that the values aliased, and it produced no warnings and would reliably "miscompile" code.  So restrict was something of a gratuitous safety footgun and no static analysis tool I have could help us catch the introduction of restrict errors.  I've been nagging GCC developers for a while about this and the latest development versions of GCC now include limited static analysis of restrict usage.

We do document which internal inputs are/aren't permuted to alias, with a hope of using restrict in the future.
Thanks for a very interesting data point and its explanation. By necessity I mostly work with for-pay compilers, not with GNU tools.
legendary
Activity: 2128
Merit: 1073
April 24, 2017, 07:17:04 PM
#12
G***it.  For Bitcoin talk moderators there is an edit button on every post right next to the quote button.  The edit screen looks almost identical to what you get when you quote a message.

After years of moderating I finally managed to reply by instead editing someone's post.  Sorry 2112. I'm going to try to get theymos to un#$@ things.   I have my reply but I don't have your original message. If by chance you have it, you could just repost. Otherwise we're stuck waiting for it to get fixed.

I am very sorry for this screwup.
Okey, let me try to fetch from drafts. I also got PM from "Guest" with what looked like your reply during composition. Post if you want to continue editing it, then I will delete the "Bitcoin Forum" quote below.

Edit: elided 2nd copy of my post now that Theymos had restored the accidentally deleted original.

Wrong:
x) r =a[ i ];
y) if (i) r = a[1]; else r = a[0];
z) if (i) r = a1; else r = a0;

Good:
a) r = i*a1 + (1-i)*a0;
b) r = i*a[1] + (1-i)*a[0];
c) for (j = 0; j <= 1; j++) r += (j == i)*a[j];

Right, using a using an integer multiply is needlessly slow but thats just the right idea.   Though it is possible to construct optimizations that are only faster if you can get away with a which is what I thought I was talking about here.

(If you're curious, here is how we accomplish those conditional accesses, we loop over the entries with https://github.com/bitcoin-core/secp256k1/blob/master/src/field_5x52_impl.h#L419 ).

Quote
vs. easy for the compiler to vectorize:
     ECsign(int n,signature [],message [], secret_key [],nonce [])
     each argument points to a vector of n objects of the particular type.


Well we  wouldn't care about vectorizing signing in any case-- at least in Bitcoin land signing many messages at once is never a bottleneck (and really if you're designing something new where it is, merkelize your data and do a single signature (perhaps with an aggregated key).  Smiley

But lets say verification: At least on x86 style SIMD efficiently doing a fast table lookup (which is needed to exploit precomputation for state of the art performance) is not easily accomplished.   I haven't checked lately, FWIW, but by CPU vanity gen is as was previously just as fast at the GPU vanity gens due to exploiting optimizations like extensive precomputation.   Though it's possible to use pshufb to access small tables, big tables are another issue. Regardless, there is no way the compiler is going to do anything too interesting there on its own, beyond perhaps optimizing some copying loops that would better be eliminated.

(Not that I would discourage someone from trying, it would be an learning experience even if it failed)

Quote
I just looked in the github at the actual header file and have one more question to ask: why not using "restricted" C pointers? In my experience that was one of the simplest and cheapest gains to achieve by switching the C pointers to non-overlapping FORTRAN semantics. It was standardized so long ago that nowadays everyone supports it.
Our use of types already means that the compilers existing assumption of the aliasing rule (no pointers of distinct types may alias each other except with char) already gets a lot of restrict's gain, as does the fact that everything gets build in a single compilation unit so the compiler can _see_ the actual non-aliasing of most things... I would love to use restrict but until about a month ago the situation with GCC was that restrict was blindly heeded, even when the compiler could _directly_ see that the values aliased, and it produced no warnings and would reliably "miscompile" code.  So restrict was something of a gratuitous safety footgun and no static analysis tool I have could help us catch the introduction of restrict errors.  I've been nagging GCC developers for a while about this and the latest development versions of GCC now include limited static analysis of restrict usage.

We do document which internal inputs are/aren't permuted to alias, with a hope of using restrict in the future.


staff
Activity: 4284
Merit: 8808
April 24, 2017, 07:00:09 PM
#11
G***it.  For Bitcoin talk moderators there is an edit button on every post right next to the quote button.  The edit screen looks almost identical to what you get when you quote a message.

After years of moderating I finally managed to reply by instead editing someone's post.  Sorry 2112. I'm going to try to get theymos to un#$@ things.   I have my reply but I don't have your original message. If by chance you have it, you could just repost. Otherwise we're stuck waiting for it to get fixed.

I am very sorry for this screwup.
legendary
Activity: 2128
Merit: 1073
April 24, 2017, 01:03:41 PM
#10
I did say unless there is a very strong validation framework.  (We have machine verified code in libsecp256k1-- but the tooling does not scale especially well, so far)
I don't want to get into word games. I understand your concerns, but I mostly disagree with emphasis in your statements:
a) stressing human/manual work over machine work
b) stressing verification/validation over automated/algorithmic generation of the high-level code

We are getting into the deep paranoia territory covered by Ken Thompson's "Reflections on trusting trust". If somebody wants to break your high-level code he can always sneak the required breakage into the build toolchain and build process.
 
On most modern platforms, including Intel's x86_64 parts,  accesses to an array take measurably different time based on the the value of address mod cache_line_size-- because loads are always of a full cache line, but data at the beginning of the cache-line is available first.  There are actual demonstrations of using this sidechannel to recover keys.
I think a miscommunication occurred here. I was trying to say that one can use any abstraction, e.g.

a) ..... a1,a2,a3 .....
b) ..... a[1],a[2],a[3] .....
c) for (i = 1; i <= 3; ++i)
      ..... a[ i ], .....

so long as the compiled code has exactly same memory access patterns, regardless of the actual values stored in the a* variables. I am not claiming that the semantics cannot be changed by suitably abusing debugging or optimization flags in the compiler, dynamic linking or thread-local storage, or some other runtime mechanism like paging MMU, SMM, TSX, etc.

In the constant time parts of libsecp256k1 we use boolean operations to construct oblivious loads (which read all the data, and select the proper row with masking) whenever we need to access a table with a secret index for this reason.
Basically what I'm saying is this: assume i is in range from 0 to 1

Wrong:
x) r =a[ i ];
y) if (i) r = a[1]; else r = a[0];
z) if (i) r = a1; else r = a0;

Good:
a) r = i*a1 + (1-i)*a0;
b) r = i*a[1] + (1-i)*a[0];
c) for (j = 0; j <= 1; j++) r += (j == i)*a[j];

Again, I'm assuming no build-time shenanigans like compiling C code with C++ compiler with operator[] overloaded, etc.

Intrinsics do not allow you to control register allocation, and on many code the entire performance advantage of using the SIMD operations can be wiped out by unnecessary spills and loads. Compilers are getting better at it... but it's not necessarily a gain.  We've use assembly elsewhere in particular to get access to the carry chain flag, which is not otherwise easily exposed. It's not really a much of a maintenance burden since we use inline assembly that avoids having to deal with difference in the calling conventions between platforms.
Well, you definitely have a point here, especially about ADD/ADC and SUB/SBB, which I used in the past when re-implementing RSA. I also learned very early in school about keeping register pressure low and keeping vector units busy when porting Cray code to cheaper hardware and porting MATLAB prototypes into self-standing C. To me it is so natural that I keep forgetting that library like secp256k1 was not designed for easy SIMD-ification.

eg. hard for the compiler to vectorize:
      ECsign(signature *,message *,secret_key *,nonce *)
      each argument points to a single object of particular type

vs. easy for the compiler to vectorize:
     ECsign(int n,signature [],message [], secret_key [],nonce [])
     each argument points to a vector of n objects of the particular type.

I just looked in the github at the actual header file and have one more question to ask: why not using "restricted" C pointers? In my experience that was one of the simplest and cheapest gains to achieve by switching the C pointers to non-overlapping FORTRAN semantics. It was standardized so long ago that nowadays everyone supports it.
staff
Activity: 4284
Merit: 8808
April 23, 2017, 10:40:35 PM
#9
We are living in XXI century. Machine-generated and machine-verifiable codes are the way to go.
I did say unless there is a very strong validation framework.  (We have machine verified code in libsecp256k1-- but the tooling does not scale especially well, so far)

Quote
Why would use of an array preclude side-channel resistance? Use absolutely any high-level abstraction in your code,
On most modern platforms, including Intel's x86_64 parts,  accesses to an array take measurably different time based on the the value of address mod cache_line_size-- because loads are always of a full cache line, but data at the beginning of the cache-line is available first.  There are actual demonstrations of using this sidechannel to recover keys.

In the constant time parts of libsecp256k1 we use boolean operations to construct oblivious loads (which read all the data, and select the proper row with masking) whenever we need to access a table with a secret index for this reason.

Quote
use processor-dependent intrinsics.
Intrinsics do not allow you to control register allocation, and on many code the entire performance advantage of using the SIMD operations can be wiped out by unnecessary spills and loads. Compilers are getting better at it... but it's not necessarily a gain.  We've use assembly elsewhere in particular to get access to the carry chain flag, which is not otherwise easily exposed. It's not really a much of a maintenance burden since we use inline assembly that avoids having to deal with difference in the calling conventions between platforms.
legendary
Activity: 2128
Merit: 1073
April 23, 2017, 09:52:52 AM
#8
FWIW, I would not expect any non-trivial program to be completely bug free unless it had a very vigorous validation framework around it that was substantially larger than the code itself; or unless it had something on the order of one man day of review per line of code by at least two different persons and at least a moderately good validation framework around it.

There has been a lot of broken crypto code out there used for generating keys, a lot of it due to underlying broken bignum implementations. There is a reason the tests in libsecp256k1 are as extensive as they are. Smiley
Unnecessarily grim assessment, IMHO.

We are living in XXI century. Machine-generated and machine-verifiable codes are the way to go. In the XXI century even the pure mathematicians are willing to accept machine-made proofs, provided that the "machines" involved are reasonably cleanly designed.

On a related issue:

but I wasn't considering it for "submission" due to the array use which is not sidechannel-resistant (and I thought it was a "requirement").
Why would use of an array preclude side-channel resistance? Use absolutely any high-level abstraction in your code, so long as the memory access patterns (both code fetch and data access) aren't depending on the actual input data, only on its size (which in this case is constant). Ultimately, your code is executed by a Von-Neuman machine which is contained in an array of memory cells.

I actually haven't read your code, just scrolled through it. Personally, I would avoid using raw assembly mnemonics and use processor-dependent intrinsics. It is more work initially (because you have to review the generated assembly), but less work in long-term maintenance.
staff
Activity: 4284
Merit: 8808
April 23, 2017, 05:24:09 AM
#7
I would have never thought consistency might have been a problem with integer values until I kind of learned this the hard way,
Very valuable lesson there.

FWIW, I would not expect any non-trivial program to be completely bug free unless it had a very vigorous validation framework around it that was substantially larger than the code itself; or unless it had something on the order of one man day of review per line of code by at least two different persons and at least a moderately good validation framework around it.

There has been a lot of broken crypto code out there used for generating keys, a lot of it due to underlying broken bignum implementations. There is a reason the tests in libsecp256k1 are as extensive as they are. Smiley
legendary
Activity: 1708
Merit: 1049
April 23, 2017, 04:16:38 AM
#6
Code:
SECP256K1_INLINE static void secp256k1_fe_sqr_inner(uint32_t *r, const uint32_t *a) {
  /*  uint64_t c, d;*/
    uint64_t result[17]; /*temp storage array*/
    uint32_t tempstor[2]; /*temp storage array*/
    const uint32_t M = 0x3FFFFFFUL, R0 = 0x3D10UL  /*, R1 = 0x400UL*/ ;

  tempstor[0]=M;
  tempstor[1]=R0;
/*tempstor[2] for R1 isn't needed. It's 1024, so shifting left by 10 instead*/

  __asm__ __volatile__(  
 /* Part #1: The multiplications and additions of
  *
  *  
  *   (uint64_t)(a[0]*2) * a[1]        =result0
  *
  *   (uint64_t)(a[0]*2) * a[2]
       + (uint64_t)a[1] * a[1]          =result1
            
      (uint64_t)(a[0]*2) * a[3]
       + (uint64_t)(a[1]*2) * a[2]    =result2
      
       (uint64_t)(a[0]*2) * a[4]       =result3
       + (uint64_t)(a[1]*2) * a[3]
       + (uint64_t)a[2] * a[2];        
      
      (uint64_t)(a[0]*2) * a[5]
       + (uint64_t)(a[1]*2) * a[4]    =result4
       + (uint64_t)(a[2]*2) * a[3];     */

 "MOVD 0(%0), %%MM0\n"  /*a0 */
 "MOVD 8(%0), %%MM2\n"  /*a2 */
 "MOVD 4(%0), %%MM1\n" /* a1 */
 "MOVQ %%MM0, %%MM6\n" /*cloning a0 to mm6*/
 "MOVQ %%MM0, %%MM4\n" /*cloning a0 to mm4*/
 "MOVQ %%MM1, %%MM5\n" /*cloning a1 to mm5*/
 "PMULUDQ %%MM2, %%MM6\n" /*a2 * a0*/
 "MOVQ %%MM1, %%MM7\n"  /*cloning a1 to mm7*/
 "PMULUDQ %%MM5, %%MM5\n" /*a1 * a1*/
 "PMULUDQ %%MM0, %%MM7\n" /*a0 * a1*/
 "MOVD 12(%0), %%MM3\n" /* a3 */
 "PADDQ %%MM6, %%MM6\n"  /* doubled a2 * a0 */
 "PADDQ %%MM5, %%MM6\n" /* (a2*a0*2) + (a1*a1) */
 "PADDQ %%MM7, %%MM7\n" /* doubled a0 * a1 */
 "MOVQ %%MM1, %%MM5\n" /*cloning a1 to mm5*/
 "MOVQ %%MM6, 8(%1)\n" /*result[1] */  
 "MOVQ %%MM7, 0(%1)\n" /*result[0] */
 
 "PMULUDQ %%MM3, %%MM4\n" /*a3 * a0*/
 "PMULUDQ %%MM2, %%MM5\n" /*a2 * a1*/  
 "MOVD 16(%0), %%MM7\n" /*a4 to mm7*/
 "PADDQ %%MM4, %%MM5\n" /* (a3 * a0) + (a2*a1) */
 "PADDQ %%MM5, %%MM5\n" /* (a3 * a0) + (a2*a1) doubled */
 "MOVQ %%MM5, 16(%1)\n" /*result[2] */  
 
 "MOVQ %%MM2, %%MM6\n" /*cloning a2 to mm6*/
 "MOVQ %%MM7, %%MM4\n" /*cloning a4 to mm4*/
 "MOVQ %%MM3, %%MM5\n" /*cloning a3 to mm5*/
 "PMULUDQ %%MM0, %%MM7\n" /*a0 * a4*/
 "PMULUDQ %%MM1, %%MM5\n" /*a1 * a3*/
 "PMULUDQ %%MM2, %%MM6\n" /*a2 * a2*/  
 "PADDQ %%MM7, %%MM5\n" /* (a0*a4) + (a1*a3) */
 "PADDQ %%MM5, %%MM5\n"/* doubling (a0*a4) + (a1*a3) */
 "MOVD 20(%0), %%MM7\n" /*prefetch a5 to mm7*/
 "PADDQ %%MM6, %%MM5\n" /*  (2* (a0*a4) + (a1*a3)) + (a2*a2) */
 "MOVQ %%MM5, 24(%1)\n" /*result[3] */  
 
 "MOVQ %%MM4, %%MM6\n" /*cloning a4 to mm6*/
 "PMULUDQ %%MM0, %%MM7\n" /*a0 * a5*/
 "MOVQ %%MM3, %%MM5\n" /*cloning a3 to mm5*/
 "PMULUDQ %%MM1, %%MM6\n" /*a1 * a4*/
 "PMULUDQ %%MM2, %%MM5\n" /*a2 * a3*/
 "PADDQ %%MM6, %%MM7\n"
 "MOVD 24(%0), %%MM6\n" /*prefetch a6 to mm6*/
 "PADDQ %%MM5, %%MM7\n"
 "MOVD 20(%0), %%MM5\n" /*prefetch a5 to mm5*/
 "PADDQ %%MM7, %%MM7\n" /*adding and doubling a1*a4 + a0*a5 + a2*a3*/
 "MOVQ %%MM7, 32(%1)\n" /*result[4] */  
 
 /* Part #2: The multiplications and additions of
  *  
       (uint64_t)(a[0]*2) * a[6]
       + (uint64_t)(a[1]*2) * a[5]
       + (uint64_t)(a[2]*2) * a[4]        =result5
       + (uint64_t)a[3] * a[3]
      
       (uint64_t)(a[0]*2) * a[7]
       + (uint64_t)(a[1]*2) * a[6]
       + (uint64_t)(a[2]*2) * a[5]
       + (uint64_t)(a[3]*2) * a[4]       =result6
      
        (uint64_t)(a[0]*2) * a[8]
       + (uint64_t)(a[1]*2) * a[7]
       + (uint64_t)(a[2]*2) * a[6]
       + (uint64_t)(a[3]*2) * a[5]
       + (uint64_t)a[4] * a[4]           =result7      */

 "PMULUDQ %%MM3, %%MM3\n" /*a3 * a3*/
 "MOVQ %%MM4, %%MM7\n" /*a4 to mm7*/
 "PMULUDQ %%MM0, %%MM6\n" /*a0 * a6*/
 "PMULUDQ %%MM1, %%MM5\n" /*a1 * a5*/
 "PMULUDQ %%MM2, %%MM7\n" /*a2 * a4*/
 "PADDQ %%MM6, %%MM5\n"
 "MOVD 24(%0), %%MM6\n" /*prefetch a6 to mm6*/  
 "PADDQ %%MM5, %%MM7\n"
 "MOVD 20(%0), %%MM5\n" /*prefetch a5 to mm5*/  
 "PADDQ %%MM7, %%MM7\n" /*adding and doubling a0*a6 + a1*a5 + a2*a4*/
 "PADDQ %%MM3, %%MM7\n" /* adding non-doubled a3*a3 */
 "MOVD 12(%0), %%MM3\n" /*prefetch a3 to mm3*/
 "MOVQ %%MM7, 40(%1)\n" /*result[5] */  
 
 "PMULUDQ %%MM2, %%MM5\n" /*a2 * a5*/
 "PMULUDQ %%MM1, %%MM6\n" /*a1 * a6*/
 "PMULUDQ %%MM3, %%MM4\n" /*a3 * a4*/
 "MOVD 28(%0), %%MM7\n" /*a7 to mm7*/
 "PMULUDQ %%MM0, %%MM7\n" /*a0 * a7*/
 "PADDQ %%MM5, %%MM6\n"
 "PADDQ %%MM6, %%MM4\n"   /* adding 4 prior multiplications */
 "MOVD 24(%0), %%MM5\n" /*prefetch a6 to mm5*/
 "PADDQ %%MM4, %%MM7\n"
 "MOVD 20(%0), %%MM4\n" /*prefetch a5 to mm4*/  
 "PADDQ %%MM7, %%MM7\n"  /* doubling the result */
 "MOVQ %%MM7, 48(%1)\n" /*result[6] */    

 "PMULUDQ %%MM4, %%MM3\n" /*a5 * a3*/
 "MOVD 28(%0), %%MM6\n" /*a7 to mm6*/
 "PMULUDQ %%MM5, %%MM2\n" /*a6 * a2*/
 "MOVD 32(%0), %%MM7\n" /*a8 to mm7*/  
 "PMULUDQ %%MM6, %%MM1\n" /*a7 * a1*/
 "MOVD 16(%0), %%MM4\n" /*a4 to mm4*/  
 "PMULUDQ %%MM7, %%MM0\n" /*a8 * a0*/
 "PMULUDQ %%MM4, %%MM4\n" /*a4 * a4*/
 "PADDQ %%MM3, %%MM2\n"
 "PADDQ %%MM2, %%MM1\n"   /* adding the first 4 multiplications */
 "PADDQ %%MM1, %%MM0\n"
 "MOVD 36(%0), %%MM2\n" /*a9 to mm2*/  
 "PADDQ %%MM0, %%MM0\n"  /* doubling the result of the first 4 multiplications */
 "MOVQ %%MM7, %%MM1\n"  /* cloning a8 to mm1*/
 "PADDQ %%MM4, %%MM0\n"  /* adding the non-doubled a4*a4 */
 "MOVQ %%MM2, %%MM3\n"  /* cloning a9 to mm3*/
 "MOVQ %%MM0, 56(%1)\n" /*result[7] */  
 
/* Part #3: The multiplications and additions of
  *  
  *   (uint64_t)a[9] * a[9]               = result8
      
       (uint64_t)(a[8]*2) * a[9]         = result9
          
       (uint64_t)(a[7]*2) * a[9]
       + (uint64_t)a[8] * a[8]            = result10
      
       (uint64_t)(a[6]*2) * a[9]
       + (uint64_t)(a[7]*2) * a[8]      = result11    
      
       (uint64_t)(a[5]*2) * a[9]
       + (uint64_t)(a[6]*2) * a[8]      = result12
       + (uint64_t)a[7] * a[7]
      
       (uint64_t)(a[4]*2) * a[9]
       + (uint64_t)(a[5]*2) * a[8]      = result13
       + (uint64_t)(a[6]*2) * a[7]
      
       (uint64_t)(a[3]*2) * a[9]
       + (uint64_t)(a[4]*2) * a[8]
       + (uint64_t)(a[5]*2) * a[7]
       + (uint64_t)a[6] * a[6]            = result14          */

 "PMULUDQ %%MM3, %%MM3\n" /*a9 * a9*/
 "MOVQ %%MM2, %%MM4\n"  /* cloning a9 to mm4*/  
 "PADDQ %%MM7, %%MM7\n" /*a8 doubled */
 "MOVQ %%MM1, %%MM0\n" /*cloning a8 to mm0*/
 "MOVQ %%MM3, 64(%1)\n" /*result[8] = a9*a9  */
 
 "PMULUDQ %%MM7, %%MM4\n" /*(a8*2) * a9*/
 "MOVQ %%MM2, %%MM3\n"  /* cloning a9 to mm3*/  
 "MOVQ %%MM4, 72(%1)\n" /*result[9]*/
 "PMULUDQ %%MM0, %%MM0\n" /*a8*a8*/  
 "PMULUDQ %%MM6, %%MM3\n" /*a7*a9*/
 "MOVQ %%MM2, %%MM4\n"  /* cloning a9 to mm4*/  
 "PMULUDQ %%MM5, %%MM4\n" /*a6*a9*/
 "PADDQ %%MM3, %%MM3\n" /*a7*a9 doubled*/  
 "PADDQ %%MM0, %%MM3\n" /*a7*a9 doubled + a8*a8  */
 "MOVQ %%MM1, %%MM0\n" /*cloning a8 to mm0*/
 "MOVQ %%MM2, %%MM7\n" /*cloning a9 to mm7*/
 "MOVQ %%MM3, 80(%1)\n" /*result[10]*/
 
 "PMULUDQ %%MM6, %%MM0\n" /*a7*a8*/  
 "MOVQ %%MM1, %%MM3\n" /*cloning a8 to mm3*/
 "PADDQ %%MM4, %%MM0\n" /*a7*a8 + a6*a9 */
 "MOVD 20(%0), %%MM4\n" /*a5 to mm4*/  
 "PADDQ %%MM0, %%MM0\n" /*a7*a8 + a6*a9 doubling*/
 "MOVQ %%MM0, 88(%1)\n" /*result[11]*/  

 "PMULUDQ %%MM5, %%MM3\n" /*a6*a8*/  
 "PMULUDQ %%MM4, %%MM7\n" /*a5*a9*/  
 "MOVQ %%MM6, %%MM0\n" /*cloning a7*/  
 "PMULUDQ %%MM0, %%MM0\n" /*a7*a7*/  
 "PADDQ %%MM3, %%MM7\n" /*a5*a9   +   a6*a8*/
 "PADDQ %%MM7, %%MM7\n"/*a5*a9   +   a6*a8   doubled*/
 "MOVD 16(%0), %%MM3\n" /*a4*/
 "PADDQ %%MM0, %%MM7\n"/*a5*a9   +   a6*a8   doubled     + a7*a7*/
 "MOVQ %%MM2, %%MM0\n" /*cloning a9*/
 "MOVQ %%MM7, 96(%1)\n" /*result[12]*/

 "PMULUDQ %%MM3, %%MM0\n" /*a4*a9*/    
 "MOVQ %%MM1, %%MM7\n" /*cloning a8*/
 "MOVQ %%MM6, %%MM3\n" /*cloning a7*/  
 "PMULUDQ %%MM4, %%MM7\n" /*a5*a8*/    
 "PMULUDQ %%MM5, %%MM3\n" /*a6*a7*/  
 "PADDQ %%MM0, %%MM7\n"
 "MOVD 12(%0), %%MM0\n" /* a3 in mm0*/
 "PADDQ %%MM3, %%MM7\n"  /*adding the 3 prior multiplications*/
 "PADDQ %%MM7, %%MM7\n"  /* doubling the result of the 3 multiplications*/
 "MOVQ %%MM7, 104(%1)\n" /*result[13]*/

 "PMULUDQ %%MM5, %%MM5\n" /*a6*a6*/    
 "MOVQ %%MM6, %%MM3\n" /*cloning a7*/
 "MOVD 16(%0), %%MM7\n" /* a4 in mm7*/    
 "PMULUDQ %%MM2, %%MM0\n" /*a3*a9*/  
 "PMULUDQ %%MM4, %%MM3\n" /*a5*a7*/  
 "PMULUDQ %%MM1, %%MM7\n" /*a4*a8*/  
 "PADDQ %%MM0, %%MM3\n"
 "MOVD 8(%0), %%MM0\n" /*a2 in mm0*/
 "PADDQ %%MM3, %%MM7\n" /*adding the last 3 multiplications*/
 "MOVD 16(%0), %%MM3\n" /*a4 in mm3*/  
 "PADDQ %%MM7, %%MM7\n" /*doubling the sum of the 3 multiplications*/
 "PADDQ %%MM5, %%MM7\n" /*adding non-doubled a6*a6*/
 "MOVQ %%MM7, 112(%1)\n" /*result[14]*/

/* Part #4: The multiplications and additions of
  *  
  *   (uint64_t)(a[2]*2) * a[9]
       + (uint64_t)(a[3]*2) * a[8]
       + (uint64_t)(a[4]*2) * a[7]
       + (uint64_t)(a[5]*2) * a[6]        =result 15
      
       (uint64_t)(a[1]*2) * a[9]
       + (uint64_t)(a[2]*2) * a[8]
       + (uint64_t)(a[3]*2) * a[7]
       + (uint64_t)(a[4]*2) * a[6]
       + (uint64_t)a[5] * a[5]              =result16
      
        (uint64_t)(a[0]*2) * a[9]
       + (uint64_t)(a[1]*2) * a[8]
       + (uint64_t)(a[2]*2) * a[7]
       + (uint64_t)(a[3]*2) * a[6]
       + (uint64_t)(a[4]*2) * a[5]        =result17  / initial value of d  */

 "MOVD 24(%0), %%MM5\n" /*a6 in mm5*/
 "PMULUDQ %%MM2, %%MM0\n" /*a2*a9*/  
 "MOVD 12(%0), %%MM7\n" /*a3 in mm7*/
 "PMULUDQ %%MM6, %%MM3\n" /*a4*a7*/
 "PMULUDQ %%MM4, %%MM5\n" /*a5*a6*/
 "PMULUDQ %%MM1, %%MM7\n" /*a3*a8*/  
 "PADDQ %%MM0, %%MM3\n"
 "MOVD 4(%0), %%MM0\n" /*a1 in mm0*/  
 "PADDQ %%MM3, %%MM5\n"
 "MOVD 16(%0), %%MM3\n" /*a4in mm3*/  
 "PADDQ %%MM5, %%MM7\n" /* ...adding all the resuts */
 "MOVD 24(%0), %%MM5\n" /*a6 in mm5*/
 "PADDQ %%MM7, %%MM7\n" /* doubling them */
 "MOVQ %%MM7, 120(%1)\n" /*result[15]*/

 "PMULUDQ %%MM3, %%MM5\n" /* a4*a6 */
 "PMULUDQ %%MM2, %%MM0\n" /* a1*a9 */
 "PMULUDQ %%MM4, %%MM4\n" /* a5*a5 */  
 "MOVD 8(%0), %%MM7\n" /*a2 in mm7*/
 "MOVD 12(%0), %%MM3\n" /*a3 in mm3*/  
 "PMULUDQ %%MM1, %%MM7\n" /* a2*a8 */  
 "PMULUDQ %%MM6, %%MM3\n" /* a3*a7 */
 "PADDQ %%MM5, %%MM0\n"
 "MOVD 24(%0), %%MM5\n" /*prefetch a6 in mm5*/
 "PADDQ %%MM0, %%MM7\n"
 "MOVD  0(%0), %%MM0\n" /*prefetch a0 in mm0*/
 "PADDQ %%MM3, %%MM7\n" /*adding the 4 multiplications except a5*a5*/
 "MOVD  4(%0), %%MM3\n" /*prefetch a1 in mm3*/
 "PADDQ %%MM7, %%MM7\n" /*...and doubling them*/
 "PADDQ %%MM4, %%MM7\n" /*adding the non-doubled a5*a5*/
 "MOVD 20(%0), %%MM4\n" /*prefetch a5 in mm4*/
 "MOVQ %%MM7, 128(%1)\n" /*result[16]*/  
 
 "MOVD  8(%0), %%MM7\n" /*a2 in mm7*/
 "PMULUDQ %%MM3, %%MM1\n" /* a1*a8 */
 "PMULUDQ %%MM0, %%MM2\n" /* a0*a9 */  
 "MOVD 12(%0), %%MM3\n" /*a3 in mm3*/
 "PMULUDQ %%MM7, %%MM6\n" /* a2*a7 */
 "MOVD 16(%0), %%MM7\n" /*a4 in mm7*/
 "PMULUDQ %%MM0, %%MM0\n" /* a0*a0 =  c initial value*/
 "PMULUDQ %%MM3, %%MM5\n" /* a3*a6 */  
 "PMULUDQ %%MM7, %%MM4\n" /* a4*a5 */    
 "PADDQ %%MM2, %%MM1\n"
 "MOVD 0(%3), %%MM2\n"  /*prefetch M to MM2 */
 "PADDQ %%MM1, %%MM6\n"
 "MOVQ %%MM0, %%MM7\n" /* move c to MM7 (a0*a0) */
 "MOVQ 128(%1), %%MM3\n" /*prefetch result16*/
 "PADDQ %%MM5, %%MM6\n"  
 "MOVQ %%MM2, %%MM0\n" /*cloning M to MM0 for secondary storage */
 "PADDQ %%MM4, %%MM6\n" /*adding all multiplications together except a0*a0 which is a different result*/
 "MOVD 4(%3), %%MM4\n"  /*R0 to MM4 */
 "PADDQ %%MM6, %%MM6\n" /*doubling the result  -- initial value for d*/
/* "MOVQ %%MM6, 144(%1)\n" */  /*result[17] - which is also the initial value for d so it can be used directly*/  
  
 
 "PAND %%MM6, %%MM2\n"   /* r[9] = d & M;  */
 "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */    
 "MOVD %%MM2, 36(%2)\n" /* r[9] = d & M;  */
 "MOVQ %%MM0, %%MM2\n"  /* M back to mm2 */
 "PADDQ %%MM3, %%MM6\n"  /*  d += (uint64_t)result[16]  */

  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ 120(%1), %%MM3\n" /*prefetch result15*/
  "MOVQ 0(%1), %%MM5\n"  /*prefetch result0 */    
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[15]  */  
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 0(%2)\n" /* exporting t0/r[0] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[0] */    

  "MOVQ 112(%1), %%MM3\n"   /*prefetch result14*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[14]  */  
  "MOVQ 8(%1), %%MM5\n"  /*prefetch result1 */      
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 4(%2)\n" /* exporting t1/r[1] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */  
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[1] */    

  "MOVQ 104(%1), %%MM3\n"  /*prefetch result13*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[13]  */  
  "MOVQ 16(%1), %%MM5\n"  /*prefetch result2 */    
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 8(%2)\n" /* exporting t2/r[2] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[2] */    

  "MOVQ 96(%1), %%MM3\n"  /*prefetch result12*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[12]  */  
  "MOVQ 24(%1), %%MM5\n"  /*prefetch result3 */      
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 12(%2)\n" /* exporting t3/r[3] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */  
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[3] */      
  
  "MOVQ 88(%1), %%MM3\n"  /*prefetch result11*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[11]  */  
  "MOVQ 32(%1), %%MM5\n"  /*prefetch result4 */    
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 16(%2)\n" /* exporting t4/r[4] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[4] */      

  "MOVQ 80(%1), %%MM3\n"  /*prefetch result10*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[10]  */
  "MOVQ 40(%1), %%MM5\n"  /*prefetch result5 */    
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 20(%2)\n" /* exporting t5/r[5] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */  
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[5] */      

  "MOVQ 72(%1), %%MM3\n"  /*prefetch result9*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[9]  */  
  "MOVQ 48(%1), %%MM5\n"  /*prefetch result6 */      
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 24(%2)\n" /* exporting t6/r[6] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[6] */        

  "MOVQ 64(%1), %%MM3\n"  /*prefetch result8*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[8]  */
  "MOVQ 56(%1), %%MM5\n"  /*prefetch result7 */    
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 28(%2)\n" /* exporting t7/r[7] = c & M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[7] */    

  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "MOVD %%MM3, 32(%2)\n" /* exporting t8/r[8] = c & M */
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  
  "MOVQ %%MM0, %%MM2\n" /*cloning M*/
  "MOVD 36(%2), %%MM1\n" /* R[9] in */    
  "PMULUDQ %%MM6, %%MM4\n" /* d * R0 */
  "PADDQ %%MM1, %%MM4\n" /* d * R0 + r[9]*/  
  "PADDQ %%MM4, %%MM7\n" /* c+=  d * R0 + r[9]*/
  
  "PSRLQ $4, %%MM0\n"    /*  M >>= 4; */
  "PAND %%MM7, %%MM0\n" /*c & M >>4*/
  "MOVD %%MM0, 36(%2)\n"  /*  r[9] = c & (M >> 4); */
  
  "PSRLQ $22, %%MM7\n"    /*  c >>= 22 */  
  "PSLLQ $14, %%MM6\n" /* d * (R1 << 4). Since (R1 << 4) equals 16384, it's essentially a left shift by 14 */
  "PADDQ %%MM6, %%MM7\n" /* c += d * (R1 << 4); */

  "MOVQ %%MM7, %%MM3\n" /*cloning c*/
  "PSLLQ $6, %%MM7\n" /*  result of c * (R1 >> 4) which equals c shifted left by 6, since (R1 >> 4) = 64 */    

  "MOVQ %%MM3, %%MM0\n"     /*this is a manual attempt at multiplying c with x3D1 or 977 decimal, by shifting and adding copies of c ...*/
  "MOVQ %%MM3, %%MM1\n"     /*all this segment, is, in reality, just a (c*977) single line multiplication */
  "MOVQ %%MM3, %%MM6\n"     /* which for some reason doesn't want to work otherwise with a plain PMULUDQ c * 977 constant  */
  "MOVQ %%MM3, %%MM4\n"
  "MOVQ %%MM3, %%MM5\n"  
  "PSLLQ $9, %%MM0\n" /* x512 */    
  "PSLLQ $8, %%MM1\n" /* x256 */    
  "PSLLQ $7, %%MM6\n" /* x128 */      
  "PSLLQ $6, %%MM4\n" /* x64 */        
  "PSLLQ $4, %%MM5\n" /* x16 */        /*512+256+128+64 = 976 times c, so +1 add on top =977 times c or c * 0x3D1 */
  "PADDQ %%MM3, %%MM0\n"
  "PADDQ %%MM1, %%MM0\n"
  "MOVD 0(%2), %%MM3\n"          /*prefetch r[0] to MM3 */
  "PADDQ %%MM6, %%MM0\n"  
  "PADDQ %%MM4, %%MM0\n"  
  "PADDQ %%MM0, %%MM5\n"     /*  result of c * (R0 >> 4) */    
    
  "PADDQ %%MM3, %%MM5\n"  /* d = r[0] + c (R0 >> 4) */
  "MOVD 4(%2), %%MM6\n"  /*r[1] to MM6 */  
  "MOVQ %%MM5, %%MM3\n" /*cloning d */
  
  "PAND %%MM2, %%MM5\n" /*d&M*/
  "MOVD 8(%2), %%MM0\n"  /*r[2] to MM0 */      
  "PSRLQ $26, %%MM3\n"    /*  d >>= 26 */    
  "PADDQ %%MM7, %%MM6\n" /* c * (R1 >> 4) + r[1] */
  "PADDQ %%MM6, %%MM3\n" /*d   += c * (R1 >> 4) + r[1];  */
  "MOVD %%MM5, 0(%2)\n" /* export d to r[0] */
  "MOVQ %%MM3, %%MM7\n" /*cloning d */
  
  "PAND %%MM2, %%MM7\n" /*d&M*/
  "PSRLQ $26, %%MM3\n"    /*  d >>= 26 */      
  "PADDQ %%MM0, %%MM3\n"  /*d   += r[2];*/
  "MOVD %%MM7, 4(%2)\n"  /*r[1] = d & M; */
  "MOVD %%MM3, 8(%2)\n"  /*r[2] =d */
  "EMMS\n"
  
 :
 : "q"(a), "q"(result), "q"(r) , "q" (tempstor)
 : "memory", "%mm0", "%mm1", "%mm2", "%mm3", "%mm4", "%mm5", "%mm6", "%mm7"
 );  
}    

If anyone wants to use it, or integrate it somewhere, feel free (no credits required), but I can't claim it'll be 100% consistent when I don't even understand why 2 multiplications broke and required a workaround.

I may try this again in the future with a packed version on ymm/zmm registers. In a packed manner it might even trounce the 5x52 field. We'll see how it goes...


edit: Damn it, I just realized why it broke. I'm multiplying doublewords to get a quadword, yet c=46bits can't fit as a doubleword source. Doh...
legendary
Activity: 1708
Merit: 1049
April 23, 2017, 04:16:11 AM
#5
The verification angle also has two other issues you aren't considering: consensus consistency.  Use of GMP in validation would make the exact behavior of GMP consensus critical, which is a problem because different systems run different code. (GMP also has a license which is more restrictive than the Bitcoin software).

I would have never thought consistency might have been a problem with integer values until I kind of learned this the hard way, losing a day, getting beaten by two multiplications in the end of field 10x26 while converting the whole thing into SSE2 run on MMX registers for my 12yr old pentium-m laptop...

d   += c * (R1 >> 4) + t1

and

d    = c * (R0 >> 4) + t0;

where R1 >>4 = 64, and R0 >>4 = 977... there was just no way to get it compute with a c * 64 / c * 977 like this: "PMULUDQ (register with 64 or 977 as source) and (register with c) as target.

I still don't know why a left shift by 6 is not the same as a multiplication with 64 - if overflowing and wrapping around is not at play (or maybe it is I don't know - but from what I see bit count should be 53 and 56, far from overflowing).

    d    = c * (R0 >> 4) + t0;
    VERIFY_BITS(d, 56);

    d   += c * (R1 >> 4) + t1;
    VERIFY_BITS(d, 53);


Anyway, my mind is still perplexed about this but I had to work around it until the test didn't break.

For the *64 I did it with shifting left by 6.

For the *977 I did it with a monstrosity where c got copied into multiple registers and then each copy got shifted appropriately and then all the shifted c's got added.

Anyway, despite this expensive workaround, native 64bit arithmetic on mmx registers trounced the gcc/clang/icc-compiled versions for a -m32 build. Opcode size was also reduced (1900 vs 2800+ bytes for field_mul, 1300 vs 2200+ bytes for field_sqr).

Quote
If you don't mean using GMP but just using different operations for non-sidechannel sensitive paths-- the library already does that in many places-- though not for FE mul/add. FE normalizes do take variable time in verification. If you have a speedup based on that feel free to submit it!

Right now the only massive speedup I have achieved is in the 32bit version where compilers use 32 bit registers (instead of 64 bit mmx/sse registers). For the 64bit version, with non-avx use, I'm at ~10% gains in non-commented code (which seems bad for reviewing) of which 3-5% was a recent gain, by reducing the clobbering parameters on the asm and manually interleaving the initial pushes and final pops at convenient stages (like multiplication and addition stalls).


The faster 32-bit version is below (I did add comments on what each line does, in case it would be useful to others, but I wasn't considering it for "submission" due to the array use which is not sidechannel-resistant (and I thought it was a "requirement"). Essentially an array is used at the start for storing the results of the multiplications and additions which are non-linear and thus can be computed at the start.)

On my laptop (pentium-m 2.13ghz) bench_verify (with endomoprhism) is down to 350us from 570us.

On my desktop (q8200 @1.86ghz) it's at 404us down from 652us. It could probably go down a further 5-10% if code readability suffers a lot (it's already suffering from some manual interleaving of memory operations with muls and adds to gain ~7-10%).

c version (-m32 / field 10x26):
field_sqr: min 0.187us / avg 0.188us / max 0.189us
field_mul: min 0.275us / avg 0.277us / max 0.278us
field_inverse: min 55.4us / avg 55.6us / max 55.8us
field_inverse_var: min 55.4us / avg 55.6us / max 55.9us
field_sqrt: min 53.8us / avg 54.1us / max 54.5us
...
context_verify: min 77649us / avg 77741us / max 77891us
context_sign: min 267us / avg 268us / max 269us
...

asm version  (-m32 / field 10x26)  (sse2 on mmx* registers):
field_sqr: min 0.101us / avg 0.101us / max 0.102us
field_mul: min 0.135us / avg 0.135us / max 0.135us
field_inverse: min 28.3us / avg 28.3us / max 28.4us
field_inverse_var: min 28.3us / avg 28.3us / max 28.4us
field_sqrt: min 28.0us / avg 28.0us / max 28.0us
...
context_verify: min 42876us / avg 43099us / max 43391us
context_sign: min 170us / avg 170us / max 170us

* In core2, it doesn't make a difference if it's xmm or mm registers (except a 2-3% speedup with the removal of EMMS in the end). On the pentium-m, the xmm register operations are very slow to begin with, I suspect they are emulated in terms of width, leading to twice the operations internally, but the mmx registers are mapped on the 80bit FPU registers which have good actual width and proper speed.

Code:
SECP256K1_INLINE static void secp256k1_fe_mul_inner(uint32_t *r, const uint32_t *a, const uint32_t * SECP256K1_RESTRICT b) {
/*  uint64_t c, d;*/
    uint64_t result[19]; /*temp storage array*/
    uint32_t tempstor[2]; /*temp storage array*/
    const uint32_t M = 0x3FFFFFFUL, R0 = 0x3D10UL /* ,R1 = 0x400UL */ ;
    
tempstor[0]=M;
tempstor[1]=R0;
/*tempstor[2] for R1 isn't needed. It's 1024, so shifting left by 10 instead*/
    
  __asm__ __volatile__(  
  
 /* Part #1: The multiplications and additions of
  *
  *   d  = (uint64_t)(uint64_t)a[0] * b[9]
       + (uint64_t)a[1] * b[8]
       + (uint64_t)a[2] * b[7]
       + (uint64_t)a[3] * b[6]
       + (uint64_t)a[4] * b[5]
       + (uint64_t)a[5] * b[4]
       + (uint64_t)a[6] * b[3]
       + (uint64_t)a[7] * b[2]
       + (uint64_t)a[8] * b[1]
       + (uint64_t)a[9] * b[0]; */
 
 "MOVD  0(%0), %%MM0\n"  /*a0 */
 "MOVD 36(%1), %%MM2\n"  /*b9 */
 "MOVD  4(%0), %%MM1\n" /* a1 */
 "MOVD 32(%1), %%MM3\n" /* b8 */
 "PMULUDQ %%MM0, %%MM2\n" /*a0 * b9*/
 "PMULUDQ %%MM1, %%MM3\n" /*a1 * b8*/
 "MOVD  8(%0), %%MM4\n"  /*a2 */
 "MOVD 28(%1), %%MM6\n"  /*b7 */
 "MOVD 12(%0), %%MM5\n" /* a3 */
 "MOVD 24(%1), %%MM7\n" /* b6 */
 "PMULUDQ %%MM4, %%MM6\n" /*a2 * b7*/
 "PMULUDQ %%MM5, %%MM7\n" /*a3 * b6*/
 "MOVD 16(%0), %%MM0\n"  /*a4 */
 "MOVD 20(%0), %%MM1\n" /* a5 */
 "PADDQ %%MM2, %%MM3\n"
 "PADDQ %%MM6, %%MM7\n"
 "PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD 20(%1), %%MM2\n"  /*b5 */
 "MOVD 16(%1), %%MM3\n" /* b4 */
 "PMULUDQ %%MM0, %%MM2\n" /*a4 * b5*/
 "PMULUDQ %%MM1, %%MM3\n" /*a5 * b4*/
 "MOVD 24(%0), %%MM0\n"  /*a6 */
 "PADDQ %%MM2, %%MM3\n"
 "MOVD 28(%0), %%MM1\n" /* a7 */
 "PADDQ %%MM3, %%MM7\n"  /*keeping result additions in mm7*/
 "MOVD 12(%1), %%MM2\n"  /*b3 */
 "MOVD  8(%1), %%MM3\n" /* b2 */
 "PMULUDQ %%MM0, %%MM2\n" /*a6 * b3*/
 "PMULUDQ %%MM1, %%MM3\n" /*a7 * b2*/
 "PADDQ %%MM2, %%MM3\n"
 "MOVD 32(%0), %%MM0\n"  /*a8 */
 "MOVD 36(%0), %%MM1\n" /* a9 */
 "PADDQ %%MM3, %%MM7\n"  /*keeping result additions in mm7*/
 "MOVD  4(%1), %%MM2\n"  /*b1 */
 "MOVD  0(%1), %%MM3\n" /* b0 */
 "PMULUDQ %%MM0, %%MM2\n" /*a8 * b1*/
 "PMULUDQ %%MM1, %%MM3\n" /*a9 * b0*/
 "PADDQ %%MM2, %%MM3\n"
 "MOVD  4(%1), %%MM2\n"  /*b1 */
 "PADDQ %%MM3, %%MM7\n"  /*keeping result additions in mm7*/
 "MOVD  8(%1), %%MM3\n" /* b2 */
 "MOVQ %%MM7, 0(%2)\n"    /* extract result[0] */
  
  /* Part #2: The multiplications and additions of
  *
  *
  d += (uint64_t)a[1] * b[9]
       + (uint64_t)a[2] * b[8]
       + (uint64_t)a[3] * b[7]
       + (uint64_t)a[4] * b[6]
       + (uint64_t)a[5] * b[5]
       + (uint64_t)a[6] * b[4]
       + (uint64_t)a[7] * b[3]
       + (uint64_t)a[8] * b[2]
       + (uint64_t)a[9] * b[1]; */
 
 "PMULUDQ %%MM1, %%MM2\n" /*a9 * b1*/
 "PMULUDQ %%MM0, %%MM3\n" /*a8 * b2*/
 "MOVD 28(%1), %%MM6\n"  /*b7 */
 "MOVD 32(%1), %%MM7\n" /* b8 */
 "PMULUDQ %%MM5, %%MM6\n" /*a3 * b7*/
 "PMULUDQ %%MM4, %%MM7\n" /*a2 * b8*/
 "PADDQ %%MM2, %%MM3\n"
 "MOVD  4(%0), %%MM0\n"  /*a1 */
 "MOVD 36(%1), %%MM2\n"  /*b9 */
 "PADDQ %%MM3, %%MM6\n"
 "MOVD 16(%0), %%MM1\n" /* a4 */
 "PADDQ %%MM6, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD 24(%1), %%MM3\n" /* b6 */
 "PMULUDQ %%MM0, %%MM2\n" /*a1 * b9*/
 "PMULUDQ %%MM1, %%MM3\n" /*a4 * b6*/
 "MOVD 28(%0), %%MM4\n" /* a7 */
 "MOVD 12(%1), %%MM5\n" /* b3 */
 "PADDQ %%MM2, %%MM7\n"
 "MOVD 20(%0), %%MM0\n"  /*a5 */
 "MOVD 24(%0), %%MM1\n" /* a6 */
 "PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD 20(%1), %%MM2\n"  /*b5 */
 "MOVD 16(%1), %%MM3\n" /* b4 */
 "PMULUDQ %%MM0, %%MM2\n" /*a5 * b5*/
 "PMULUDQ %%MM1, %%MM3\n" /*a6 * b4*/
 "PMULUDQ %%MM4, %%MM5\n" /*a7 * b3*/
 "MOVD 20(%1), %%MM6\n" /* b5 */
 "PADDQ %%MM2, %%MM7\n"
 "MOVD 16(%0), %%MM2\n"  /*a4 */
 "PADDQ %%MM5, %%MM7\n"
 "PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD 24(%1), %%MM5\n"  /*b6 */
 "MOVQ %%MM7, 8(%2)\n"   /* extract result[1] */  
 
   /* Part #3: The multiplications and additions of
  *
  *     d += (uint64_t)a[2] * b[9]
       + (uint64_t)a[3] * b[8]
       + (uint64_t)a[4] * b[7]  
       + (uint64_t)a[5] * b[6]
       + (uint64_t)a[6] * b[5]
       + (uint64_t)a[7] * b[4]
       + (uint64_t)a[8] * b[3]
       + (uint64_t)a[9] * b[2];*/

 "PMULUDQ %%MM1, %%MM6\n" /*a6 * b5*/
 "MOVD 16(%1), %%MM7\n"  /*b4 */
 "PMULUDQ %%MM0, %%MM5\n" /*a5 * b6*/
 "PMULUDQ %%MM4, %%MM7\n" /*a7 * b4*/
 "MOVD 36(%1), %%MM3\n"  /*b9 */
 "MOVD 12(%0), %%MM1\n" /* a3 */
 "PADDQ %%MM6, %%MM5\n"
 "MOVD 32(%1), %%MM4\n" /* b8 */
 "PADDQ %%MM5, %%MM7\n"  /*keeping result additions in mm7*/
 "MOVD  8(%0), %%MM0\n"  /* a2 */
 "MOVD 28(%1), %%MM5\n"  /*b7 */
 "PMULUDQ %%MM1, %%MM4\n" /*a3 * b8*/
 "PMULUDQ %%MM2, %%MM5\n" /*a4 * b7*/
 "PMULUDQ %%MM0, %%MM3\n" /*a2 * b9*/
 "MOVD 12(%1), %%MM6\n"  /*b3 */
 "PADDQ %%MM4, %%MM7\n"
 "PADDQ %%MM5, %%MM7\n"
 "MOVD 36(%0), %%MM4\n" /* a9 */
 "MOVD 32(%0), %%MM5\n"  /*a8 */
 "PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD  8(%1), %%MM3\n"  /*b2 */
 "PMULUDQ %%MM5, %%MM6\n" /*a8 * b3*/
 "PMULUDQ %%MM4, %%MM3\n" /*a9 * b2  - (order is b2 * a9) */
 "MOVD 12(%1), %%MM0\n"  /*b3 */
 "PADDQ %%MM6, %%MM7\n"
 "MOVD 32(%1), %%MM6\n"  /*b8 */
 "PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/  
 "MOVD 16(%1), %%MM3\n"  /*b4 */
 "MOVQ %%MM7, 16(%2)\n"   /* extract result[2] */  
 
  /* Part #4: The multiplications and additions of
  *
  *
  *    d += (uint64_t)a[3] * b[9]
       + (uint64_t)a[4] * b[8]
       + (uint64_t)a[5] * b[7]
       + (uint64_t)a[6] * b[6]
       + (uint64_t)a[7] * b[5]
       + (uint64_t)a[8] * b[4]
       + (uint64_t)a[9] * b[3]; */

 "PMULUDQ %%MM4, %%MM0\n" /*a9 * b3*/
 "MOVD 36(%1), %%MM7\n" /* b9 */
 "PMULUDQ %%MM5, %%MM3\n" /*a8 * b4*/
 "PMULUDQ %%MM2, %%MM6\n" /*a4 * b8*/
 "PMULUDQ %%MM1, %%MM7\n" /*a3 * b9*/  
 "PADDQ %%MM0, %%MM3\n"
 "MOVD 24(%1), %%MM4\n"  /*b6 */
 "MOVD 20(%1), %%MM5\n" /* b5 */
 "PADDQ %%MM3, %%MM6\n"
 "MOVD 20(%0), %%MM0\n"  /*a5 */
 "MOVD 28(%0), %%MM2\n"  /*a7 */
 "PADDQ %%MM6, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD 24(%0), %%MM6\n"  /*a6 */
 "MOVD 28(%1), %%MM3\n"  /*b7 */
 "PMULUDQ %%MM2, %%MM5\n" /*a7 * b5*/
 "PMULUDQ %%MM6, %%MM4\n" /*a6 * b6 */
 "PMULUDQ %%MM0, %%MM3\n" /*a5 * b7*/
 "PADDQ %%MM5, %%MM7\n"
 "MOVD 16(%0), %%MM1\n"  /*a4 */  
 "PADDQ %%MM4, %%MM7\n"
 "MOVD 32(%1), %%MM5\n"  /*b8 */  
 "PADDQ %%MM3, %%MM7\n"
 "MOVD 28(%1), %%MM4\n"  /*b7 */
 "MOVQ %%MM7, 24(%2)\n"   /* extract result[3] */  
 
  /* Part #5: The multiplications and additions of
  *  
        d += (uint64_t)a[4] * b[9]
       + (uint64_t)a[5] * b[8]
       + (uint64_t)a[6] * b[7]
       + (uint64_t)a[7] * b[6]
       + (uint64_t)a[8] * b[5]
       + (uint64_t)a[9] * b[4];  */
    
 "PMULUDQ %%MM6, %%MM4\n" /*a6 * b7 */
 "MOVD 24(%1), %%MM3\n" /* b6 */
 "MOVD 36(%1), %%MM7\n"  /*b9 */
 "PMULUDQ %%MM0, %%MM5\n" /*a5 * b8*/
 "PMULUDQ %%MM2, %%MM3\n" /*a7 * b6*/
 "PMULUDQ %%MM1, %%MM7\n" /*a4 * b9*/
 "PADDQ %%MM4, %%MM5\n"
 "MOVD 36(%0), %%MM4\n"  /*a9 */
 "PADDQ %%MM5, %%MM3\n"
 "MOVD 20(%1), %%MM5\n"  /*b5 */
 "PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD 32(%0), %%MM3\n" /* a8 */
 "MOVD 16(%1), %%MM1\n"  /*b4 */
 "PMULUDQ %%MM3, %%MM5\n" /*a8 * b5 */
 "PMULUDQ %%MM4, %%MM1\n" /*a9 * b4 */
 "PADDQ %%MM5, %%MM7\n"
 "MOVD 28(%1), %%MM3\n"  /*b7 */
 "MOVD 32(%1), %%MM5\n"  /*b8 */  
 "PADDQ %%MM1, %%MM7\n"
 "MOVQ %%MM7, 32(%2)\n"   /* extract result[4] */  
 
  /* Part #6: The multiplications and additions of
  *
  *
  *   d += (uint64_t)a[5] * b[9]
       + (uint64_t)a[6] * b[8]
       + (uint64_t)a[7] * b[7]
       + (uint64_t)a[8] * b[6]
       + (uint64_t)a[9] * b[5];
       */
  
 "PMULUDQ %%MM2, %%MM3\n" /*a7 * b7*/
 "MOVD 20(%1), %%MM1\n" /* b5 */
 "MOVD 36(%1), %%MM7\n"  /*b9 */
 "PMULUDQ %%MM6, %%MM5\n" /*a6 * b8*/
 "PMULUDQ %%MM4, %%MM1\n" /*a9 * b5 */
 "PMULUDQ %%MM0, %%MM7\n" /*a5 * b9*/
 "PADDQ %%MM3, %%MM5\n"
 "MOVD 24(%1), %%MM3\n"  /*b6 */
 "PADDQ %%MM1, %%MM5\n"
 "MOVD 32(%0), %%MM1\n" /* a8 */
 "PADDQ %%MM5, %%MM7\n" /*keeping result additions in mm7*/
 "PMULUDQ %%MM1, %%MM3\n" /*a8 * b6 */
 "MOVD 24(%1), %%MM0\n" /* b6 */
 "MOVD 32(%1), %%MM5\n"  /*b8 */  
 "PADDQ %%MM3, %%MM7\n"
 "MOVQ %%MM7, 40(%2)\n"   /* extract result[5] */  
 
   /* Part #7: The multiplications and additions of
  *
  *
  *    d += (uint64_t)a[6] * b[9]
       + (uint64_t)a[7] * b[8]
       + (uint64_t)a[8] * b[7]
       + (uint64_t)a[9] * b[6]; */
 
 "PMULUDQ %%MM4, %%MM0\n" /*a9 * b6 */
 "MOVD 28(%1), %%MM3\n"  /*b7 */
 "MOVD 36(%1), %%MM7\n"  /*b9 */  
 "PMULUDQ %%MM2, %%MM5\n" /*a7 * b8*/
 "PMULUDQ %%MM1, %%MM3\n" /*a8 * b7*/
 "PMULUDQ %%MM6, %%MM7\n" /*a6 * b9*/
 "PADDQ %%MM0, %%MM5\n"
 "PADDQ %%MM3, %%MM7\n"
 "MOVD 24(%1), %%MM0\n" /* b6 */
 "PADDQ %%MM5, %%MM7\n" /*adding results to mm7 */
 "MOVQ %%MM7, 48(%2)\n"   /* extract result[6] */
 
 
 /* Part #8: The multiplications and additions of 3 separate results
  *
  *  
       d += (uint64_t)a[7] * b[9]
       + (uint64_t)a[8] * b[8]
       + (uint64_t)a[9] * b[7];        result 7

       d += (uint64_t)a[8] * b[9]
       + (uint64_t)a[9] * b[8];        result 8

      d += (uint64_t)a[9] * b[9];    result 9 */
    
  
 "MOVD 28(%1), %%MM3\n"  /*b7 */
 "MOVD 32(%1), %%MM5\n"  /*b8 */
 "MOVD 36(%1), %%MM7\n"  /*b9 */  
 "PMULUDQ %%MM4, %%MM3\n" /*a9 * b7 */
 "PMULUDQ %%MM1, %%MM5\n" /*a8 * b8*/
 "MOVQ %%MM7, %%MM6\n"  /*b9 */  
 "PMULUDQ %%MM2, %%MM7\n" /*a7 * b9*/
 "PADDQ %%MM3, %%MM5\n"
 "PADDQ %%MM5, %%MM7\n"
 "MOVQ %%MM6, %%MM3\n"  /*b9 */  
 "MOVD 32(%1), %%MM5\n"  /*b8 */
 "MOVQ %%MM7, 56(%2)\n"   /* extract result[7] */
 "PMULUDQ %%MM1, %%MM6\n" /*a8 * b9*/  
 "PMULUDQ %%MM4, %%MM5\n" /*a9 * b8*/
 "PMULUDQ %%MM4, %%MM3\n" /*a9 * b9*/
 "MOVD 8(%0), %%MM7\n" /*a2*/
 "PADDQ %%MM5, %%MM6\n"
 "MOVQ %%MM3,  72(%2)\n"   /* extract result[9] */
 "MOVQ %%MM6,  64(%2)\n"   /* extract result[8] */
 
  /* Part #9: The multiplications and additions of
  *
  *       c += (uint64_t)a[0] * b[8]
       + (uint64_t)a[1] * b[7]  
       + (uint64_t)a[2] * b[6]  
       + (uint64_t)a[3] * b[5]  
       + (uint64_t)a[4] * b[4]
       + (uint64_t)a[5] * b[3]
       + (uint64_t)a[6] * b[2]
       + (uint64_t)a[7] * b[1]
       + (uint64_t)a[8] * b[0];   */
 
  
 "PMULUDQ %%MM7, %%MM0\n" /*a2 * b6 */
 "MOVD  0(%1), %%MM3\n" /* b0 */
 "PMULUDQ %%MM1, %%MM3\n" /*a8 * b0*/
 "MOVD  4(%1), %%MM7\n" /* b1 */
 "PMULUDQ %%MM2, %%MM7\n" /*a7 * b1*/
 "MOVD 4(%0), %%MM1\n" /*a1*/
 "PADDQ %%MM0, %%MM3\n"
 "MOVD 20(%1), %%MM4\n"  /*b5*/  
 "PADDQ %%MM3, %%MM7\n"  
 "MOVD 12(%0), %%MM2\n"  /*a3*/
 "MOVD 28(%1), %%MM5\n"  /*b7*/
 "PMULUDQ %%MM2, %%MM4\n" /*a3 * b5*/
 "MOVD 16(%0), %%MM3\n"  /*a4*/
 "MOVD 16(%1), %%MM6\n" /*b4*/
 "PMULUDQ %%MM1, %%MM5\n" /*a1 * b7*/
 "MOVD 0(%0), %%MM0\n" /*a0*/
 "PMULUDQ %%MM6, %%MM3\n" /*b4 * a4*/
 "MOVD 32(%1), %%MM6\n"  /*b8*/
 "PMULUDQ %%MM0, %%MM6\n" /*a0 * b8*/
 "PADDQ %%MM4, %%MM5\n"
 "MOVD 24(%0), %%MM4\n" /*a6*/
 "PADDQ %%MM6, %%MM3\n"
 "PADDQ %%MM5, %%MM7\n"
 "MOVD 8(%1), %%MM6\n"  /*b2*/
 "PADDQ %%MM3, %%MM7\n"
 "MOVD 12(%1), %%MM5\n"  /*b3*/
 "MOVD 20(%0), %%MM3\n" /*a5*/
 "PMULUDQ %%MM4, %%MM6\n" /*a6 * b2*/
 "PMULUDQ %%MM3, %%MM5\n" /*a5 * b3*/
 "PADDQ %%MM6, %%MM7\n"
 "MOVD 0(%1), %%MM4\n" /*b0*/
 "PADDQ %%MM5, %%MM7\n" /*addition results on mm7*/
 "MOVD 8(%0), %%MM3\n"  /*a2*/  
 "MOVD 4(%1), %%MM5\n"  /*b1*/
 "MOVQ %%MM7, 80(%2)\n"  /* extract result[10] */
 
 
   /* Part #10: The multiplications and additions of
  *
  *         c += (uint64_t)a[0] * b[3]
       + (uint64_t)a[1] * b[2]
       + (uint64_t)a[2] * b[1]
       + (uint64_t)a[3] * b[0];       result11  
  
       c += (uint64_t)a[0] * b[1]
       + (uint64_t)a[1] * b[0];       result12

        c  = (uint64_t)a[0] * b[0];  result13      
      
           c += (uint64_t)a[0] * b[2]
       + (uint64_t)a[1] * b[1]
       + (uint64_t)a[2] * b[0];       result14   */
        

 "PMULUDQ  %%MM4, %%MM2\n" /*b0 * a3*/
 "PMULUDQ  %%MM5, %%MM3\n" /*b1 * a2*/
 "MOVD 8(%1), %%MM7\n"  /*b2*/
 "MOVD 12(%1), %%MM6\n"  /*b3*/
 "PMULUDQ  %%MM7, %%MM1\n" /*b2 * a1*/
 "PMULUDQ  %%MM6, %%MM0\n" /*b3 * a0*/
 "PADDQ %%MM2, %%MM3\n"
 "MOVD 8(%1), %%MM2\n"  /*b2*/  
 "PADDQ %%MM1, %%MM0\n"  
 "MOVD 4(%1), %%MM1\n"  /*b1*/  
 "PADDQ %%MM0, %%MM3\n"  
 "MOVD 0(%1), %%MM0\n"  /*b0*/  
 "MOVQ %%MM3, 88(%2)\n"  /* extract result[11] */  
 
 "MOVD 0(%0), %%MM4\n" /*a0*/
 "MOVQ %%MM0, %%MM3\n" /*b0*/
 "MOVD 4(%0), %%MM5\n"  /*a1*/
 "MOVD 8(%0), %%MM7\n"  /*a2*/
 "MOVQ %%MM1, %%MM6\n" /*b1*/
 "PMULUDQ %%MM5, %%MM3\n" /*a1 * b0*/
 "PMULUDQ %%MM4, %%MM6\n" /*a0 * b1*/
 "PADDQ %%MM3, %%MM6\n"
 "MOVQ %%MM0, %%MM3\n" /*b0*/
 "MOVQ %%MM6, 96(%2)\n"  /* extract result[12] */  
 "PMULUDQ %%MM4, %%MM3\n" /*a0 * b0*/
 "PMULUDQ %%MM4, %%MM2\n" /*a0 * b2*/
 "MOVQ %%MM3, 104(%2)\n"  /* extract result[13] */    
 "MOVQ %%MM1, %%MM6\n" /*b1*/
 "PMULUDQ %%MM5, %%MM6\n" /*a1 * b1*/
 "MOVQ %%MM0, %%MM3\n" /*b0*/
 "PMULUDQ %%MM7, %%MM3\n" /*a2 * b0*/  
 "PADDQ %%MM2, %%MM6\n"
 "PADDQ %%MM6, %%MM3\n"
 "MOVD 16(%1), %%MM2\n"  /*b4*/  
 "MOVQ %%MM3, 112(%2)\n"  /* extract result[14] */  
 
  /* Part #11: The multiplications and additions of
  *
  *  
     c += (uint64_t)a[0] * b[4]  
       + (uint64_t)a[1] * b[3]  
       + (uint64_t)a[2] * b[2]
       + (uint64_t)a[3] * b[1]
       + (uint64_t)a[4] * b[0]  */

 "PMULUDQ %%MM4, %%MM2\n" /*a0 * b4 */
 "MOVD 16(%0), %%MM3\n"  /*a4*/    
 "MOVD 12(%0), %%MM6\n" /* a3*/
 "PMULUDQ %%MM0, %%MM3\n" /*b0 * a4 */
 "PMULUDQ %%MM1, %%MM6\n" /*b1 * a3 */  
 "PADDQ %%MM2, %%MM3\n"
 "MOVD 12(%1), %%MM2\n"  /*b3*/    
 "PADDQ %%MM3, %%MM6\n"
 "MOVD 8(%1), %%MM3\n"  /*b2*/  
 "PMULUDQ %%MM5, %%MM2\n" /*a1 * b3 */
 "PMULUDQ %%MM7, %%MM3\n" /*a2 * b2 */
 "PADDQ %%MM2, %%MM6\n"
 "MOVD 20(%1), %%MM2\n"  /*b5*/    
 "PADDQ %%MM3, %%MM6\n"
 "MOVD 20(%0), %%MM3\n"  /*a5*/    
 "MOVQ %%MM6, 120(%2)\n"  /* extract result[15] */  
 
   /* Part #12: The multiplications and additions of
  *
  *       c += (uint64_t)a[0] * b[5]
       + (uint64_t)a[1] * b[4]
       + (uint64_t)a[2] * b[3]
       + (uint64_t)a[3] * b[2]
       + (uint64_t)a[4] * b[1]
       + (uint64_t)a[5] * b[0] */  
  
 "PMULUDQ %%MM4, %%MM2\n" /*a0 * b5 */
 "MOVD 16(%0), %%MM6\n" /* a4*/
 "PMULUDQ %%MM0, %%MM3\n" /*b0 * a5 */
 "PMULUDQ %%MM1, %%MM6\n" /*b1 * a4*/  
 "PADDQ %%MM2, %%MM3\n"
 "MOVD 16(%1), %%MM2\n"  /*b4*/  
 "PADDQ %%MM3, %%MM6\n" /*adding results to mm6*/
 "MOVD 12(%1), %%MM3\n"  /*b3*/    
 "PMULUDQ %%MM5, %%MM2\n" /*a1 * b4 */
 "PMULUDQ %%MM7, %%MM3\n" /*a2 * b3 */
 "MOVD 8(%1), %%MM0\n"  /*b2*/  
 "PADDQ %%MM2, %%MM6\n"
 "MOVD 12(%0), %%MM2\n" /* a3*/
 "PADDQ %%MM3, %%MM6\n"
 "PMULUDQ %%MM0, %%MM2\n" /*b2 * a3 */    
 "MOVD 24(%0), %%MM3\n"  /*a6*/  
 "MOVD 0(%1), %%MM0\n"  /*b0*/  
 "PADDQ %%MM2, %%MM6\n" /* all additions end up in mm6*/
 "MOVD 24(%1), %%MM2\n"  /*b6*/  
 "MOVQ %%MM6, 128(%2)\n"  /* extract result[16] */  
  
    /* Part #13: The multiplications and additions of
  *  
  *     c += (uint64_t)a[0] * b[6]
       + (uint64_t)a[1] * b[5]
       + (uint64_t)a[2] * b[4]
       + (uint64_t)a[3] * b[3]
       + (uint64_t)a[4] * b[2]
       + (uint64_t)a[5] * b[1]
       + (uint64_t)a[6] * b[0];   */
  
 "PMULUDQ %%MM0, %%MM3\n" /*a6 * b0 */
 "MOVD 20(%0), %%MM6\n" /* a5*/  
 "PMULUDQ %%MM4, %%MM2\n" /*b6 * a0 */
 "PMULUDQ %%MM1, %%MM6\n" /*a5 * b1*/    
 "PADDQ %%MM2, %%MM3\n"
 "PADDQ %%MM3, %%MM6\n" /*adding all results on mm6*/
 "MOVD 20(%1), %%MM2\n"  /*b5*/  
 "MOVD 16(%1), %%MM3\n"  /*b4*/    
 "PMULUDQ %%MM5, %%MM2\n" /*a1 * b5*/
 "PMULUDQ %%MM7, %%MM3\n" /*a2 * b4*/
 "MOVD 8(%1), %%MM4\n"  /*b2*/  
 "MOVD 12(%1), %%MM1\n"  /*b3*/  
 "PADDQ %%MM2, %%MM6\n"
 "MOVD 12(%0), %%MM2\n"  /*a3*/  
 "PADDQ %%MM3, %%MM6\n"
 "MOVD 16(%0), %%MM3\n"  /*a4*/    
 "PMULUDQ %%MM1, %%MM2\n" /*b3 * a3 */
 "PMULUDQ %%MM4, %%MM3\n" /*b2 * a4 */
 "PADDQ %%MM2, %%MM6\n"
 "MOVD 4(%1), %%MM1\n"  /*b1*/
 "MOVD 24(%0), %%MM2\n" /*a6*/
 "PADDQ %%MM3, %%MM6\n"
 "MOVD 0(%0), %%MM4\n"  /*a0*/  
 "MOVQ %%MM6, 136(%2)\n"  /* extract result[17] */  
  
 /* Part #14: The multiplications and additions of
  *  
  *       c += (uint64_t)a[0] * b[7]
       + (uint64_t)a[1] * b[6]  
       + (uint64_t)a[2] * b[5]
       + (uint64_t)a[3] * b[4]
       + (uint64_t)a[4] * b[3]
       + (uint64_t)a[5] * b[2]
       + (uint64_t)a[6] * b[1]  
       + (uint64_t)a[7] * b[0];    */
      
 "PMULUDQ %%MM2, %%MM1\n" /*a6 * b1 */
 "MOVD 28(%0), %%MM6\n" /*a7*/
 "MOVD 28(%1), %%MM3\n" /*b7*/  
 "PMULUDQ %%MM6, %%MM0\n" /*a7 * b0 */
 "PMULUDQ %%MM3, %%MM4\n" /*b7 * a0*/    
 "MOVD 24(%1), %%MM6\n" /*b6*/
 "MOVD 20(%1), %%MM2\n" /*b5*/
 "PMULUDQ %%MM6, %%MM5\n" /*b6 * a1 */
 "PMULUDQ %%MM2, %%MM7\n" /*b5 * a2 */
 "PADDQ %%MM0, %%MM1\n"
 "MOVQ 8(%2), %%MM3\n"/*prefetch result1*/
 "PADDQ %%MM4, %%MM5\n"
 "MOVD 12(%0), %%MM0\n" /*a3*/
 "MOVD  8(%1), %%MM6\n" /*b2*/
 "PADDQ %%MM1, %%MM5\n"
 "MOVD 20(%0), %%MM2\n" /*a5*/
 "MOVD 12(%1), %%MM4\n" /*b3*/
 "PADDQ %%MM5, %%MM7\n"
 "PMULUDQ %%MM2, %%MM6\n" /*a5 * b2 */
 "MOVD 16(%0), %%MM1\n" /*a4*/
 "MOVD 16(%1), %%MM5\n" /*b4*/
 "PMULUDQ %%MM1, %%MM4\n" /*a4 * b3 */
 "PMULUDQ %%MM0, %%MM5\n" /*a3 * b4*/    
 "PADDQ %%MM6, %%MM4\n"
 "MOVD 0(%4), %%MM2\n"  /*prefetch M to MM2 */
 "PADDQ %%MM4, %%MM5\n"
 "PADDQ %%MM7, %%MM5\n"
 "MOVQ 0(%2), %%MM6\n" /* prefetch d in from result[0]*/
 "MOVQ %%MM2, %%MM0\n" /*M secondary storage */
 "MOVQ %%MM5, 144(%2)\n"  /* extract result[18] */  
  

  "MOVQ 104(%2), %%MM7\n" /* c in from result[13] */
  "PAND %%MM6, %%MM2\n"   /* r[9] = d & M;  */
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */    
  "MOVD 4(%4), %%MM4\n"  /*R0 to MM4 */  
  "MOVD %%MM2, 36(%3)\n" /* extract r[9] = d & M;  */
  "PADDQ %%MM3, %%MM6\n"  /*  d += (uint64_t)result[1]  */
  "MOVQ %%MM0, %%MM2\n"  /* M back to mm2 */
  
  "MOVQ 16(%2), %%MM3\n"  /*prefetch result2*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[2]  */
  "MOVQ 96(%2), %%MM5\n"  /* prefetch result12 */    
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 0(%3)\n" /* exporting t0/r[0] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[12] */    
  
  "MOVQ 24(%2), %%MM3\n"/*prefetch result3*/  
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[3]  */
  "MOVQ 112(%2), %%MM5\n"  /* prefetch result14 */      
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 4(%3)\n" /* exporting t1/r[1] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[14] */    
  
  "MOVQ 32(%2), %%MM3\n"/*prefetch result4*/    
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[4]  */
  "MOVQ 88(%2), %%MM5\n"  /* prefetch result11 */      
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 8(%3)\n" /* exporting t2/r[2] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */  
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[11] */    

  "MOVQ 40(%2), %%MM3\n"/*prefetch result5*/    
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[5]  */
  "MOVQ 120(%2), %%MM5\n"  /* prefetch result15 */    
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 12(%3)\n" /* exporting t3/r[3] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */    
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[15] */    
  
  "MOVQ 48(%2), %%MM3\n"/*prefetch result6*/    
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[6]  */
  "MOVQ 128(%2), %%MM5\n"  /* prefetch result16 */    
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 16(%3)\n" /* exporting t4/r[4] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */  
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[16] */    

  "MOVQ 56(%2), %%MM3\n"/*prefetch result7*/    
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[7]  */
  "MOVQ 136(%2), %%MM5\n"  /* prefetch result16 */      
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 20(%3)\n" /* exporting t5/r[5] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */    
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[16] */      
  
  "MOVQ 64(%2), %%MM3\n"/*prefetch result8*/    
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[8]  */
  "MOVQ 144(%2), %%MM5\n"  /* prefetch result18 */        
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 24(%3)\n" /* exporting t6/r[6] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */  
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[18] */      
  
  "MOVQ 72(%2), %%MM3\n"/*prefetch result9*/  
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[9]  */
  "MOVQ 80(%2), %%MM5\n"  /* prefetch result10 */      
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 28(%3)\n" /* exporting t7/r[7] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */    
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[10] */      
  
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */  
  "MOVQ %%MM0, %%MM2\n" /*cloning M to mm2*/
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 32(%3)\n" /* exporting t8/r[8] = c & M */

  "PMULUDQ %%MM6, %%MM4\n" /* d * R0 */
  "PSRLQ $4, %%MM0\n"    /*  M >>= 4 */  
  "MOVD 36(%3), %%MM1\n" /* R[9] in */    
  "PADDQ %%MM4, %%MM7\n" /* c+=  d * R0 */
  "PADDQ %%MM1, %%MM7\n" /* c+=  r[9]     ===> c+= d * R0 + r[9]*/  

  "PAND %%MM7, %%MM0\n"  /* c & (M >> 4) */
  "MOVD %%MM0, 36(%3)\n"  /*  r[9] = c & (M >> 4) */

  "PSRLQ $22, %%MM7\n"    /*  c >>= 22 */  
  "PSLLQ $14, %%MM6\n" /* d * (R1 << 4). Since (R1 << 4) equals 16384, it's essentially a left shift by 14 */
  "PADDQ %%MM6, %%MM7\n" /* c += d * (R1 << 4) */
 
  
  "MOVQ %%MM7, %%MM3\n" /*cloning c*/
  "PSLLQ $6, %%MM7\n" /*  result of c * (R1 >> 4) which equals c shifted left by 6, since (R1 >> 4) = 64 */    

  "MOVQ %%MM3, %%MM0\n"     /*this is a manual attempt at multiplying c with x3D1 or 977 decimal, by shifting and adding copies of c ...*/
  "MOVQ %%MM3, %%MM1\n"     /*all this segment, is, in reality, just a (c*977) single line multiplication */
  "MOVQ %%MM3, %%MM6\n"     /* which for some reason doesn't want to work otherwise with a plain PMULUDQ c * 977 constant  */
  "MOVQ %%MM3, %%MM4\n"
  "MOVQ %%MM3, %%MM5\n"  
  "PSLLQ $9, %%MM0\n" /* x512 */    
  "PSLLQ $8, %%MM1\n" /* x256 */    
  "PSLLQ $7, %%MM6\n" /* x128 */      
  "PSLLQ $6, %%MM4\n" /* x64 */        
  "PSLLQ $4, %%MM5\n" /* x16 */        /*512+256+128+64 = 976x, so +1 add on top =977 or 0x3D1 */
  "PADDQ %%MM3, %%MM0\n"
  "PADDQ %%MM1, %%MM6\n"
  "MOVD 0(%3), %%MM3\n"  /*prefetch r[0] to MM3 */    
  "PADDQ %%MM4, %%MM0\n"
  "PADDQ %%MM6, %%MM0\n"  
  "PADDQ %%MM0, %%MM5\n"     /*  result of c * (R0 >> 4) */    
  
  "PADDQ %%MM3, %%MM5\n"  /* d = r[0] + c (R0 >> 4) */
  "MOVD 4(%3), %%MM4\n"  /*r[1] to MM4 */  
  "MOVD 8(%3), %%MM0\n"  /*r[2] to MM5 */    
  "MOVQ %%MM5, %%MM3\n" /*cloning d */

  "PAND %%MM2, %%MM5\n" /*d&M*/
  "PSRLQ $26, %%MM3\n"    /*  d >>= 26 */    
  "PADDQ %%MM7, %%MM4\n" /* c * (R1 >> 4) + r[1] */
  "PADDQ %%MM4, %%MM3\n" /*d   += c * (R1 >> 4) + r[1];  */
  "MOVD %%MM5, 0(%3)\n" /* export d to r[0] */
  "MOVQ %%MM3, %%MM7\n" /*cloning d */
  
  "PAND %%MM2, %%MM7\n" /*d&M*/
  "PSRLQ $26, %%MM3\n"    /*  d >>= 26 */      
  "PADDQ %%MM0, %%MM3\n"  /*d   += r[2];*/
  "MOVD %%MM7, 4(%3)\n"  /*r[1] = d & M; */
  "MOVD %%MM3, 8(%3)\n" /*r[2]=d;*/
  "EMMS\n"

:
: "q"(a), "q"(b), "q"(result), "q"(r), "S"(tempstor)
: "memory", "%mm0", "%mm1", "%mm2", "%mm3", "%mm4", "%mm5", "%mm6", "%mm7"
);
}

staff
Activity: 4284
Merit: 8808
April 22, 2017, 10:50:09 PM
#4
You've missed part of my post:

" just not gratuitously slower. It is also optimized for the specific kinds of operations which are performed in elliptic curve crypto-- which seldom consists of repeatedly multiplying values in a loop, but instead has sequences of multiplications and additions and the FE representation is built to make these additions very fast. "

The library was originally written using GMP for field arithmetic, then reduced to using it for verification only, then reduced to not using it at all when the new code was faster for that actual usage on most hosts.

The verification angle also has two other issues you aren't considering: consensus consistency.  Use of GMP in validation would make the exact behavior of GMP consensus critical, which is a problem because different systems run different code. (GMP also has a license which is more restrictive than the Bitcoin software).

If you don't mean using GMP but just using different operations for non-sidechannel sensitive paths-- the library already does that in many places-- though not for FE mul/add. FE normalizes do take variable time in verification. If you have a speedup based on that feel free to submit it!
legendary
Activity: 1708
Merit: 1049
April 22, 2017, 10:59:25 AM
#3
libsecp256k1 is a crypto library-- not a bignum library.

Its primitive operations are constant time for sidechannel resistance, they're not expected to be faster than GMP for most things-- just not gratuitously slower.

From what I understand sidechannel resistance is required for signing, so that the private key is not leaked, right? Verification is all public data, so it wouldn't matter (in the case of BTC).

If that is the case, and the rationale is correct, then it would probably make sense, for an app like bitcoin with asymmetrical verification/signature loads (where you need to verify thousands upon thousands to sync the blockchain for the last few days, yet you only send money a couple of times), to have duplicate operations:

-One which isn't really side-channel resistant and is very fast and
-one which isn't very fast (or very slow indeed) but is extremely hardened against side-channel attacks, so that you won't lose money.

Depending what you want to do (syncing the blockchain or signing a new tx), you call different functions and enjoy the best of both worlds: Maximum speed when you need to verify txs in bulk, and maximum security when you sign a tx.

Am I wrong somewhere?
staff
Activity: 4284
Merit: 8808
April 19, 2017, 12:20:37 AM
#2
libsecp256k1 is a crypto library-- not a bignum library.

Its primitive operations are constant time for sidechannel resistance, they're not expected to be faster than GMP for most things-- just not gratuitously slower. It is also optimized for the specific kinds of operations which are performed in elliptic curve crypto-- which seldom consists of repeatedly multiplying values in a loop, but instead has sequences of multiplications and additions and the FE representation is built to make these additions very fast.

Your benchmarking technique which makes no use of most of the output is also likely to get poor results due to parts of the benchmark being optimized out.

The functions you are calling are not exposed parts of the library, are not stable interfaces, and shouldn't be used by external software-- as I mentioned, it's not a bignum library. The functions that it exports are defined in include/secp256k1.h ... you are bypassing the public interface by including the .c file.

As an aside, why would you ever want to perform arithmetic over 2^256 - 4294968273? The group order would be unsurprising...
legendary
Activity: 1932
Merit: 2077
April 18, 2017, 09:10:40 AM
#1
Hi,
I want to exploit the speed of the field operations in libsecp256k1 to make faster operations modulo p (p=0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F).

I wrote a simple program to compute a multiplication of two 256-bit numbers. My code performs 100M times the mult: Gx*Gy mod p.

There are 2 versions: A (gmp library), B (secp256k1 library)

The problem is that at the moment I can get better results with gmp library than libsecp256k1 (at least +10%):

Program A (gmp library):  

Code:
---@ubuntu:~/src/var/mul$ time ./a.out

Result: fd3dc529c6eb60fb 9d166034cf3c1a5a 72324aa9dfd3428a 56d7e1ce0179fd9b

real 0m5.399s
user 0m5.324s
sys 0m0.012s


Program B (secp256k1 library):
Code:
---@ubuntu:~/src/var/mul$ time ./a.out

Result: fd 3d c5 29 c6 eb 60 fb 9d 16 60 34 cf 3c 1a 5a 72 32 4a a9 df d3 42 8a 56 d7 e1 ce 1 79 fd 9b

real 0m6.076s
user 0m5.988s
sys 0m0.004s


Shouldn't it be the other way around?  Is there something wrong in my code?




Program A (gmp library) :
Code:
#include 
#include
#include

inline void red(unsigned long int* a) {

...> performs "a mod p"

}

int main(){

unsigned long int i;
unsigned long int a[8];

unsigned long int Gx[4] = {0x59f2815b16f81798, 0x029bfcdb2dce28d9, 0x55a06295ce870b07, 0x79be667ef9dcbbac};
unsigned long int Gy[4] = {0x9c47d08ffb10d4b8, 0xfd17b448a6855419, 0x5da4fbfc0e1108a8, 0x483ada7726a3c465};


for(i=0;i<100000000;i++) {

mpn_mul_n(a, Gx, Gy, 4); // #1M
red(a);
}

printf("Result: %lx %lx %lx %lx\n", a[3], a[2], a[1], a[0]);

return 0;
}

Program B (libsecp256k1 library) :
Code:
//Compile with:                                                     
//gcc -O2 -I secp256k1/src/ -I secp256k1/  mul.c   -lgmp

#include
#include
#include "libsecp256k1-config.h"
#include "include/secp256k1.h"
#include "secp256k1.c"

int main(){


        const unsigned char Gx[32]={0x79,0xBE,0x66,0x7E,0xF9,0xDC,0xBB,0xAC,0x55,0xA0,0x62,0x95,0xCE,//
                            0x87,0x0B,0x07,0x02,0x9B,0xFC,0xDB,0x2D,0xCE,0x28,0xD9,0x59,0xF2, //
                                             0x81,0x5B,0x16,0xF8,0x17,0x98};

        const unsigned char Gy[32]={0x48,0x3A,0xDA,0x77,0x26,0xA3,0xC4,0x65,0x5D,0xA4,//
                      0xFB,0xFC,0x0E,0x11,0x08,0xA8,0xFD,0x17,0xB4,//
                                               0x48,0xA6,0x85,0x54,0x19,0x9C,0x47,0xD0,0x8F,0xFB,0x10,0xD4,0xB8};


       secp256k1_fe a,b,r;
       unsigned char c[32];

       secp256k1_fe_set_b32(&a, Gx);
       secp256k1_fe_set_b32(&b, Gy);


       for(i=0;i<100000000;i++) { //100 000 000

    secp256k1_fe_mul(&r, &a, &b);
    //secp256k1_fe_normalize_var(&r);


       }

       secp256k1_fe_get_b32(c, &r);

       printf("Result: %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x %x\n", //
       c[0],c[1],c[2],c[3],c[4],c[5],c[6],c[7],c[8],c[9],c[10],c[11],c[12],//
       c[13],c[14],c[15],c[16],c[17],c[18],c[19],c[20],c[21],c[22],c[23],c[24],//
       c[25],c[26],c[27],c[28],c[29],c[30],c[31]);

       return 0;
}

Jump to: