Neat. You shouldn't benchmark using the tests: they're full of debugging instrumentation that distorts the performance and spend a lot of their time on random things. Compile with --enable-benchmarks and use the benchmarks.
A quick check on i7-4600U doesn't give a really clear result:
Before:
field_sqr: min 0.0915us / avg 0.0917us / max 0.0928us
field_mul: min 0.116us / avg 0.116us / max 0.117us
field_inverse: min 25.2us / avg 25.7us / max 28.5us
field_inverse_var: min 13.8us / avg 13.9us / max 14.0us
field_sqrt: min 24.9us / avg 25.0us / max 25.2us
ecdsa_verify: min 238us / avg 238us / max 239us
After (v1):
field_sqr: min 0.0924us / avg 0.0924us / max 0.0928us
field_mul: min 0.117us / avg 0.117us / max 0.117us
field_inverse: min 25.4us / avg 25.5us / max 25.9us
field_inverse_var: min 13.7us / avg 13.7us / max 14.0us
field_sqrt: min 25.1us / avg 25.3us / max 26.1us
ecdsa_verify: min 237us / avg 237us / max 237us
After (v2):
field_sqr: min 0.0942us / avg 0.0942us / max 0.0944us
field_mul: min 0.118us / avg 0.118us / max 0.119us
field_inverse: min 25.9us / avg 26.0us / max 26.4us
field_inverse_var: min 13.6us / avg 13.7us / max 13.8us
field_sqrt: min 25.6us / avg 25.9us / max 27.8us
ecdsa_verify: min 243us / avg 244us / max 246us
Hmm... interesting how different architectures are affected. Unless you are underclocked, I think for that particular cpu the times are pretty slow - is there any debugging or performance-logging framework running on top of this that creates overhead, distorting the performance? (Although I do expect newer chips to have better schedulers). Realistically, you should be quite faster than me. (my lib is with ./configure -enable-benchmark and gcc default flags, no endomorphism).
For comparison (q8200 @ 1.86)
Before:
field_sqr: min 0.0680us / avg 0.0681us / max 0.0683us
field_mul: min 0.0833us / avg 0.0835us / max 0.0841us
field_inverse: min 18.5us / avg 18.6us / max 18.8us
field_inverse_var: min 6.32us / avg 6.32us / max 6.33us
field_sqrt: min 18.4us / avg 18.6us / max 18.9us
ecdsa_verify: min 243us / avg 243us / max 245us
(v1)
field_sqr: min 0.0654us / avg 0.0660us / max 0.0667us
field_mul: min 0.0819us / avg 0.0822us / max 0.0825us
field_inverse: min 18.4us / avg 18.4us / max 18.5us
field_inverse_var: min 6.35us / avg 6.36us / max 6.37us
field_sqrt: min 18.4us / avg 18.4us / max 18.5us
ecdsa_verify: min 235us / avg 236us / max 237us
(v2)
field_sqr: min 0.0660us / avg 0.0675us / max 0.0679us
field_mul: min 0.0858us / avg 0.0861us / max 0.0862us
field_inverse: min 18.8us / avg 18.8us / max 18.8us
field_inverse_var: min 6.31us / avg 6.31us / max 6.31us
field_sqrt: min 18.5us / avg 18.6us / max 18.7us
ecdsa_verify: min 243us / avg 243us / max 244us
I've always used the benchmarks provided, but I think they may lack real world correlation. My wakeup call was a few months ago I disassembled the bench_internal to see what clang and gcc were doing differently in terms of asm... one was inlining/merging the benchmark and the function to be benchmarked, thus saving the overhead of calling it and distorting the result. I think it was clang which was merging it - and that particular benchmark was faster for it. So I couldn't tell due to this type of distortion which implementation was actually faster. I think it would be a nice addition if we had something like the validation of, say, a given amount of bitcoin blocks (let's say 10-20mb of data loaded in ram) as a more RL-like benchmark.
Btw, I remember having seen a video where you gave a lecture about the library to a university (?) and commenting on the tests of the library, saying something to the effect that perhaps in the future a bounty can be issued about bugs that exist but can't be detected by the tests.
Asm tampering (especially if you try to repurpose rdi/rsi registers) is definitely one of the fields were you can have the test run fine and then have bench_internal or bench_verify abort due to error. Or the opposite (benchmark run ok, test crashes). Or have it be entirely ok in one compiler (test/benchmarks) and then crash in another. This is due to the compiler using the registers differently prior or after the functions in conjuction with other functions, and the same code is some times OK in certain use cases (executables) and crashes in other use cases (different executables), so it's much trickier than C because I have no idea how a test could catch these. After all it can only test it's own execution.
My "manual" testing routine to see if everything is ok, is by going ./tests, ./bench_internal, ./bench_verify. If everything passes, it's probably good. This is not for the 5x52 (which doesn't have unstable code in it) but for my custom made 4x64 impl.h with different secp256k1_scalar_mul and secp256k1_scalar_sqr).
I wanted to put the whole file in so that no cutting and splicing are needed for 2 functions, but the forum notification is bugged (saying I have a post >64kbytes when I don't) so I had to cut down on the text. Anyway...
static void secp256k1_scalar_mul(secp256k1_scalar *r, const secp256k1_scalar *a, const secp256k1_scalar *b) {
#ifdef USE_ASM_X86_64
uint64_t l[8];
const uint64_t *pb = b->d;
__asm__ __volatile__(
/* Preload */
"movq 0(%%rdi), %%r15\n"
"movq 8(%%rdi), %%rbx\n"
"movq 16(%%rdi), %%rcx\n"
"movq 0(%%rdx), %%r11\n"
"movq 8(%%rdx), %%r9\n"
"movq 16(%%rdx), %%r10\n"
"movq 24(%%rdx), %%r8\n"
/* (rax,rdx) = a0 * b0 */
"movq %%r15, %%rax\n"
"mulq %%r11\n"
/* Extract l0 */
"movq %%rax, 0(%%rsi)\n"
/* (r14,r12,r13) = (rdx) */
"movq %%rdx, %%r14\n"
"xorq %%r12, %%r12\n"
"xorq %%r13, %%r13\n"
/* (r14,r12,r13) += a0 * b1 */
"movq %%r15, %%rax\n"
"mulq %%r9\n"
"addq %%rax, %%r14\n"
"adcq %%rdx, %%r12\n"
"movq %%rbx, %%rax\n"
"adcq $0, %%r13\n"
/* (r14,r12,r13) += a1 * b0 */
"mulq %%r11\n"
"addq %%rax, %%r14\n"
"adcq %%rdx, %%r12\n"
/* Extract l1 */
"movq %%r14, 8(%%rsi)\n"
"movq $0, %%r14\n"
/* (r12,r13,r14) += a0 * b2 */
"movq %%r15, %%rax\n"
"adcq $0, %%r13\n"
"mulq %%r10\n"
"addq %%rax, %%r12\n"
"adcq %%rdx, %%r13\n"
"movq %%rbx, %%rax\n"
"adcq $0, %%r14\n"
/* (r12,r13,r14) += a1 * b1 */
"mulq %%r9\n"
"addq %%rax, %%r12\n"
"adcq %%rdx, %%r13\n"
"movq %%rcx, %%rax\n"
"adcq $0, %%r14\n"
/* (r12,r13,r14) += a2 * b0 */
"mulq %%r11\n"
"addq %%rax, %%r12\n"
"adcq %%rdx, %%r13\n"
/* Extract l2 */
"movq %%r12, 16(%%rsi)\n"
"movq $0, %%r12\n"
/* (r13,r14,r12) += a0 * b3 */
"movq %%r15, %%rax\n"
"adcq $0, %%r14\n"
"mulq %%r8\n"
"addq %%rax, %%r13\n"
"adcq %%rdx, %%r14\n"
/* Preload a3 */
"movq 24(%%rdi), %%r15\n"
/* (r13,r14,r12) += a1 * b2 */
"movq %%rbx, %%rax\n"
"adcq $0, %%r12\n"
"mulq %%r10\n"
"addq %%rax, %%r13\n"
"adcq %%rdx, %%r14\n"
"movq %%rcx, %%rax\n"
"adcq $0, %%r12\n"
/* (r13,r14,r12) += a2 * b1 */
"mulq %%r9\n"
"addq %%rax, %%r13\n"
"adcq %%rdx, %%r14\n"
"movq %%r15, %%rax\n"
"adcq $0, %%r12\n"
/* (r13,r14,r12) += a3 * b0 */
"mulq %%r11\n"
"addq %%rax, %%r13\n"
"adcq %%rdx, %%r14\n"
/* Extract l3 */
"movq %%r13, 24(%%rsi)\n"
"movq $0, %%r13\n"
/* (r14,r12,r13) += a1 * b3 */
"movq %%rbx, %%rax\n"
"adcq $0, %%r12\n"
"mulq %%r8\n"
"addq %%rax, %%r14\n"
"adcq %%rdx, %%r12\n"
"movq %%rcx, %%rax\n"
"adcq $0, %%r13\n"
/* (r14,r12,r13) += a2 * b2 */
"mulq %%r10\n"
"addq %%rax, %%r14\n"
"adcq %%rdx, %%r12\n"
"movq %%r15, %%rax\n"
"adcq $0, %%r13\n"
/* (r14,r12,r13) += a3 * b1 */
"mulq %%r9\n"
"addq %%rax, %%r14\n"
"adcq %%rdx, %%r12\n"
"movq %%rcx, %%rax\n"
"adcq $0, %%r13\n"
/* Extract l4 */
/* "movq %%r14, 32(%%rsi)\n"*/
/* (r12,r13,r14) += a2 * b3 */
"mulq %%r8\n"
"movq %%r14, %%r11\n"
"xorq %%r14, %%r14\n"
"addq %%rax, %%r12\n"
"movq %%r15, %%rax\n"
"adcq %%rdx, %%r13\n"
"adcq $0, %%r14\n"
/* (r12,r13,r14) += a3 * b2 */
"mulq %%r10\n"
"addq %%rax, %%r12\n"
"adcq %%rdx, %%r13\n"
"movq %%r15, %%rax\n"
"adcq $0, %%r14\n"
/* Extract l5 */
/*"movq %%r12, 40(%%rsi)\n"*/
/* (r13,r14) += a3 * b3 */
"mulq %%r8\n"
"addq %%rax, %%r13\n"
"adcq %%rdx, %%r14\n"
/* Extract l6 */
/*"movq %%r13, 48(%%rsi)\n"*/
/* Extract l7 */
/*"movq %%r14, 56(%%rsi)\n"*/
: "+d"(pb)
: "S"(l), "D"(a->d)
: "rax", "rbx", "rcx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");
__asm__ __volatile__(
/* Preload. */
/* "movq 32(%%rsi), %%r11\n" */
/* "movq 40(%%rsi), %%r12\n" */
/*"movq 48(%%rsi), %%r13\n" */
/* "movq 56(%%rsi), %%r14\n" */
"movq 0(%%rsi), %%rbx\n"
"movq %3, %%rax\n"
"movq %%rax, %%r10\n"
"xor %%ecx, %%ecx\n"
"xorq %%r15, %%r15\n"
"xorq %%r9, %%r9\n"
"xorq %%r8, %%r8\n"
"mulq %%r11\n"
"addq %%rax, %%rbx\n" /*q0 into rbx*/
"adcq %%rdx, %%rcx\n"
"addq 8(%%rsi), %%rcx\n"
"movq %%r10, %%rax\n"
"adcq %%r9, %%r15\n"
"mulq %%r12\n"
"addq %%rax, %%rcx\n" /*q1 stored to rcx*/
"adcq %%rdx, %%r15\n"
"movq %4, %%rax\n"
"adcq %%r9, %%r8\n"
"mulq %%r11\n"
"addq %%rax, %%rcx\n"
"adcq %%rdx, %%r15\n"
"adcq %%r9, %%r8\n"
"addq 16(%%rsi), %%r15\n"
"adcq %%r9, %%r8\n"
"movq %%r10, %%rax\n"
"adcq %%r9, %%r9\n"
"mulq %%r13\n"
"addq %%rax, %%r15\n"
"adcq %%rdx, %%r8\n"
"movq %4, %%rax\n"
"adcq $0, %%r9\n"
"mulq %%r12\n"
"addq %%rax, %%r15\n"
"adcq %%rdx, %%r8\n"
"adcq $0, %%r9\n"
"movq %%r10, %%rax\n"
"movq $0, %%r10\n"
"addq %%r11, %%r15\n" /*q2 into r15*/
"adcq $0, %%r8\n"
"adcq $0, %%r9\n"
"addq 24(%%rsi), %%r8\n"
"adcq $0, %%r9\n"
"adcq %%r10, %%r10\n"
"mulq %%r14\n"
"addq %%rax, %%r8\n"
"adcq %%rdx, %%r9\n"
"movq %4, %%rax\n"
"movq %%rax, %%rsi\n"
"adcq $0, %%r10\n"
"mulq %%r13\n"
"addq %%rax, %%r8\n"
"adcq %%rdx, %%r9\n"
"adcq $0, %%r10\n"
"addq %%r8, %%r12\n" /* q3 into r12*/
"adcq $0, %%r9\n"
"movq $0, %%r8\n"
"movq %%rsi, %%rax\n"
"adcq $0, %%r10\n"
"mulq %%r14\n"
"addq %%rax, %%r9\n"
"adcq %%rdx, %%r10\n"
"adcq %%r8, %%r8\n"
"addq %%r9, %%r13\n" /*q4 into r13*/
"adcq $0, %%r10\n"
"adcq $0, %%r8\n"
"addq %%r14, %%r10\n" /* q5 into r10 */
"movq %3, %%rax\n"
"movq %%rax, %%r9\n"
"adcq $0, %%r8\n" /*q6 into r8*/
/* %q5 input for second operation is %q0 output from first / RBX as the connecting link
%q6 input for second operation is %q1 output from first / RCX as the connecting link
%q7 input for second operation is %q2 output from first / R15 as the connecting link
%q8 input for second operation is %q3 output from first / R12 as the connecting link
%q9 input for second operation is %q4 output from first / R13 as the connecting link*
%q10 input for second operation is %q5 output from first / R10 as the connecting link*
%q11 input for second operation is %q6 output from first / R8 as the connecting link */
/* Reduce 385 bits into 258. */
"mulq %%r13\n"
"xorq %%r14, %%r14\n"
"xorq %%r11, %%r11\n"
"addq %%rax, %%rbx\n" /* q0 output*/
"adcq %%rdx, %%r14\n"
"addq %%rcx, %%r14\n"
"mov $0, %%ecx\n"
"movq %%r9, %%rax\n"
"adcq %%r11, %%r11\n"
"mulq %%r10\n"
"addq %%rax, %%r14\n"
"adcq %%rdx, %%r11\n"
"movq %%rsi, %%rax\n"
"adcq %%rcx, %%rcx\n"
"mulq %%r13\n"
"addq %%rax, %%r14\n" /* q1 output */
"movq %%r9, %%rax\n"
"adcq %%rdx, %%r11\n"
"adcq $0, %%rcx\n"
"xorq %%r9, %%r9\n"
"addq %%r15, %%r11\n"
"adcq %%r9, %%rcx\n"
"movq %%rax, %%r15\n"
"adcq %%r9, %%r9\n"
"mulq %%r8\n"
"addq %%rax, %%r11\n"
"adcq %%rdx, %%rcx\n"
"movq %%rsi, %%rax\n"
"adcq $0, %%r9\n"
"mulq %%r10\n"
"addq %%rax, %%r11\n"
"adcq %%rdx, %%rcx\n"
"adcq $0, %%r9\n"
"addq %%r13, %%r11\n" /* q2 output */
"adcq $0, %%rcx\n"
"adcq $0, %%r9\n"
"addq %%r12, %%rcx\n"
"movq %%rsi, %%rax\n"
"adcq $0, %%r9\n"
"mulq %%r8\n"
"addq %%rax, %%rcx\n"
"adcq %%rdx, %%r9\n"
"addq %%r10, %%rcx\n" /* q3 output */
"adcq $0, %%r9\n"
"movq %%r15, %%rax\n"
"addq %%r8, %%r9\n" /* q4 output */
/* %q1 input for next operation is %q0 output from prior / RBX as the connecting link
%q2 input for next operation is %q1 output from prior / R14 as the connecting link
%q3 input for next operation is %q2 output from prior / R11 as the connecting link
%q4 input for next operation is %q3 output from prior / RCX as the connecting link
%q5 input for next operation is %q4 output from prior / R9 as the connecting link */
/* Reduce 258 bits into 256. */
"mulq %%r9\n"
"addq %%rbx, %%rax\n"
"adcq $0, %%rdx\n"
"movq %%rax, %%r8\n" /* 0(q2) output */
"movq %%rdx, %%r12\n"
"xorq %%r13, %%r13\n"
"addq %%r14, %%r12\n"
"movq %%rsi, %%rax\n"
"adcq %%r13, %%r13\n"
"mulq %%r9\n"
"addq %%rax, %%r12\n" /* 8(q2) output */
"adcq %%rdx, %%r13\n"
"xor %%ebx, %%ebx\n"
"addq %%r9, %%r13\n"
"adcq %%rbx, %%rbx\n"
"movq $0xffffffffffffffff, %%r14\n"
"addq %%r11, %%r13\n" /* 16(q2) output */
"movq $0, %%r11\n"
"adcq $0, %%rbx\n"
"addq %%rcx, %%rbx\n" /* 24(q2) output */
"adcq $0, %%r11\n" /* c output */
/*FINAL REDUCTION */
/* r8 carries ex 0(%%rdi),
r12 carries ex 8(%%rdi),
r13 carries ex 16(%%rdi),
rbx carries ex 24(%%rdi)
r11 carries c */
"movq $0xbaaedce6af48a03b,%%r9\n"
"movq $0xbaaedce6af48a03a,%%rcx\n"
"movq $0xbfd25e8cd0364140,%%r10\n"
"cmp %%r14 ,%%rbx\n"
"setne %%dl\n"
"cmp $0xfffffffffffffffd,%%r13\n"
"setbe %%al\n"
"or %%eax,%%edx\n"
"cmp %%rcx,%%r12\n"
"setbe %%cl\n"
"or %%edx,%%ecx\n"
"cmp %%r9,%%r12\n"
"movzbl %%dl,%%edx\n"
"seta %%r9b\n"
"cmp %%r10,%%r8\n"
"movzbl %%cl,%%ecx\n"
"seta %%r10b\n"
"not %%ecx\n"
"not %%edx\n"
"or %%r10d,%%r9d\n"
"movzbl %%r9b,%%r9d\n"
"and %%r9d,%%ecx\n"
"xor %%r9d,%%r9d\n"
"cmp %%r14,%%r13\n"
"sete %%r9b\n"
"xor %%r10d,%%r10d\n"
"and %%r9d,%%edx\n"
"or %%edx,%%ecx\n"
"xor %%edx,%%edx\n"
"add %%ecx,%%r11d\n"
"imulq %%r11,%%r15\n"
"addq %%r15,%%r8\n"
"adcq %%rdx,%%r10\n"
"imulq %%r11,%%rsi\n"
"xorq %%r15,%%r15\n"
"xor %%eax,%%eax\n"
"movq %%r8,0(%q2)\n"
"xor %%edx,%%edx\n"
"addq %%r12,%%rsi\n"
"adcq %%rdx,%%rdx\n"
"addq %%rsi,%%r10\n"
"movq %%r10,8(%q2)\n"
"adcq %%rdx,%%r15\n"
"addq %%r11,%%r13\n"
"adcq %%rax,%%rax\n"
"addq %%r15,%%r13\n"
"movq %%r13,16(%q2)\n"
"adcq $0,%%rax\n"
"addq %%rbx,%%rax\n"
"movq %%rax,24(%q2)\n"
: "=D"(r)
: "S"(l), "D"(r), "n"(SECP256K1_N_C_0), "n"(SECP256K1_N_C_1)
: "rax", "rbx", "rcx", "rdx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");
#else
uint64_t l[8];
secp256k1_scalar_mul_512(l, a, b);
secp256k1_scalar_reduce_512(r, l);
#endif
}
static void secp256k1_scalar_sqr(secp256k1_scalar *r, const secp256k1_scalar *a) {
#ifdef USE_ASM_X86_64
uint64_t l[8];
__asm__ __volatile__(
/* Preload */
"movq 0(%%rdi), %%r11\n"
"movq 8(%%rdi), %%r12\n"
"movq 16(%%rdi), %%rcx\n"
"movq 24(%%rdi), %%r14\n"
/* (rax,rdx) = a0 * a0 */
"movq %%r11, %%rax\n"
"mulq %%r11\n"
/* Extract l0 */
"movq %%rax, %%rbx\n" /*0(%%rsi)\n"*/
/* (r8,r9,r10) = (rdx,0) */
"movq %%rdx, %%r15\n"
"xorq %%r9, %%r9\n"
"xorq %%r10, %%r10\n"
"xorq %%r8, %%r8\n"
/* (r8,r9,r10) += 2 * a0 * a1 */
"movq %%r11, %%rax\n"
"mulq %%r12\n"
"addq %%rax, %%r15\n"
"adcq %%rdx, %%r9\n"
"adcq $0, %%r10\n"
"addq %%rax, %%r15\n" /*8 rsi in r15*/
"adcq %%rdx, %%r9\n"
"movq %%r11, %%rax\n"
"adcq $0, %%r10\n"
/* Extract l1 */
/* 8(rsi) in r15*/
/* (r9,r10,r8) += 2 * a0 * a2 */
"mulq %%rcx\n"
"addq %%rax, %%r9\n"
"adcq %%rdx, %%r10\n"
"adcq $0, %%r8\n"
"addq %%rax, %%r9\n"
"adcq %%rdx, %%r10\n"
"movq %%r12, %%rax\n"
"adcq $0, %%r8\n"
/* (r9,r10,r8) += a1 * a1 */
"mulq %%r12\n"
"addq %%rax, %%r9\n"
"adcq %%rdx, %%r10\n"
/* Extract l2 */
"movq %%r9, 16(%%rsi)\n"
"movq %%r11, %%rax\n"
"movq $0, %%r9\n"
/* (r10,r8,r9) += 2 * a0 * a3 */
"adcq $0, %%r8\n"
"mulq %%r14\n"
"addq %%rax, %%r10\n"
"adcq %%rdx, %%r8\n"
"adcq $0, %%r9\n"
"addq %%rax, %%r10\n"
"adcq %%rdx, %%r8\n"
"movq %%r12, %%rax\n"
"adcq $0, %%r9\n"
/* (r10,r8,r9) += 2 * a1 * a2 */
"mulq %%rcx\n"
"addq %%rax, %%r10\n"
"adcq %%rdx, %%r8\n"
"adcq $0, %%r9\n"
"addq %%rax, %%r10\n"
"adcq %%rdx, %%r8\n"
"movq %%r10, %%r13\n"
"movq %%r12, %%rax\n"
"adcq $0, %%r9\n"
/* Extract l3 */
/*"movq %%r10, 24(%%rsi)\n"*/
/* (r8,r9,r10) += 2 * a1 * a3 */
"mulq %%r14\n"
"xorq %%r10, %%r10\n"
"addq %%rax, %%r8\n"
"adcq %%rdx, %%r9\n"
"adcq $0, %%r10\n"
"addq %%rax, %%r8\n"
"adcq %%rdx, %%r9\n"
"movq %%rcx, %%rax\n"
"adcq $0, %%r10\n"
/* (r8,r9,r10) += a2 * a2 */
"mulq %%rcx\n"
"addq %%rax, %%r8\n"
"adcq %%rdx, %%r9\n"
/* Extract l4 */
/*"movq %%r8, 32(%%rsi)\n"*/
"movq %%r8, %%r11\n"
"movq %%rcx, %%rax\n"
"movq $0, %%r8\n"
/* (r9,r10,r8) += 2 * a2 * a3 */
"adcq $0, %%r10\n"
"mulq %%r14\n"
"addq %%rax, %%r9\n"
"adcq %%rdx, %%r10\n"
"adcq $0, %%r8\n"
"addq %%rax, %%r9\n"
"adcq %%rdx, %%r10\n"
"movq %%r14, %%rax\n"
"adcq $0, %%r8\n"
/* Extract l5 */
/*"movq %%r9, 40(%%rsi)\n"*/
/* "movq %%r9, %%r12\n"*/
/* (r10,r8) += a3 * a3 */
"mulq %%r14\n"
"addq %%rax, %%r10\n"
/* Extract l6 */
/*"movq %%r10, 48(%%rsi)\n"*/
/*"movq %%r10, %%rcx\n"*/
/* Extract l7 */
/*"movq %%r8, 56(%%rsi)\n"*/
/*"movq %%r8, %%r14\n"*/
:
: "S"(l), "D"(a->d)
: "rax", "rbx", "rcx", "rdx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");
__asm__ __volatile__(
/* Preload. */
/* "movq 32(%%rsi), %%r11\n" */
/* "movq 40(%%rsi), %%r9\n" */
/* "movq 48(%%rsi), %%r10\n" */
/* "movq 56(%%rsi), %%r8\n" */
/* "movq 0(%%rsi), %%rbx\n" */
/* "movq %%rcx, %%r13\n"*/
"movq %3, %%rax\n"
"adcq %%rdx, %%r8\n"
"mulq %%r11\n"
"xor %%ecx, %%ecx\n"
"xorq %%r12, %%r12\n"
"xorq %%r14, %%r14\n"
"addq %%rax, %%rbx\n" /*q0 into rbx*/
"adcq %%rdx, %%rcx\n"
/* "addq 8(%%rsi), %%rcx\n" */
"addq %%r15, %%rcx\n"
"mov $0, %%r15d\n"
"movq %3, %%rax\n"
"adcq %%r12, %%r15\n"
"mulq %%r9\n"
"addq %%rax, %%rcx\n" /*q1 stored to rcx*/
"adcq %%rdx, %%r15\n"
"movq %4, %%rax\n"
"adcq %%r12, %%r14\n"
"mulq %%r11\n"
"addq %%rax, %%rcx\n"
"adcq %%rdx, %%r15\n"
"adcq %%r12, %%r14\n"
"addq 16(%%rsi), %%r15\n"
"adcq %%r12, %%r14\n"
"movq %3, %%rax\n"
"adcq %%r12, %%r12\n"
"mulq %%r10\n"
"movq %4, %%rsi\n"
"addq %%rax, %%r15\n"
"adcq %%rdx, %%r14\n"
"movq %%rsi, %%rax\n"
"adcq $0, %%r12\n"
"mulq %%r9\n"
"addq %%rax, %%r15\n"
"adcq %%rdx, %%r14\n"
"adcq $0, %%r12\n"
"movq %3, %%rax\n"
"addq %%r11, %%r15\n" /*q2 into r15*/
"adcq $0, %%r14\n"
"adcq $0, %%r12\n"
"addq %%r13, %%r14\n"
"movq $0, %%r13\n"
"adcq $0, %%r12\n"
"adcq $0, %%r13\n"
"mulq %%r8\n"
"addq %%rax, %%r14\n"
"movq %%rsi, %%rax\n"
"adcq %%rdx, %%r12\n"
"adcq $0, %%r13\n"
"mulq %%r10\n"
"addq %%rax, %%r14\n"
"adcq %%rdx, %%r12\n"
"adcq $0, %%r13\n"
"addq %%r14, %%r9\n" /* q3 into r9*/
"adcq $0, %%r12\n"
"movq %%rsi, %%rax\n"
"movq $0, %%r14\n"
"adcq $0, %%r13\n"
"mulq %%r8\n"
"addq %%rax, %%r12\n"
"adcq %%rdx, %%r13\n"
"adcq %%r14, %%r14\n"
"addq %%r12, %%r10\n" /*q4 into r10*/
"adcq $0, %%r13\n"
"adcq $0, %%r14\n"
"addq %%r8, %%r13\n" /* q5 into r13 */
"movq %3, %%rax\n"
"movq %%rax, %%r12\n"
"adcq $0, %%r14\n" /*q6 into r14*/
/* %q5 input for second operation is %q0 output from first / RBX as the connecting link
%q6 input for second operation is %q1 output from first / RCX as the connecting link
%q7 input for second operation is %q2 output from first / R15 as the connecting link
%q8 input for second operation is %q3 output from first / r9 as the connecting link
%q9 input for second operation is %q4 output from first / r10 as the connecting link*
%q10 input for second operation is %q5 output from first / r13 as the connecting link*
%q11 input for second operation is %q6 output from first / r14 as the connecting link */
/* Reduce 385 bits into 258. */
"mulq %%r10\n"
"xorq %%r8, %%r8\n"
"xorq %%r11, %%r11\n"
"addq %%rax, %%rbx\n" /* q0 output*/
"adcq %%rdx, %%r8\n"
"addq %%rcx, %%r8\n"
"movq %%r12, %%rax\n"
"mov $0, %%ecx\n"
"adcq %%r11, %%r11\n"
"mulq %%r13\n"
"addq %%rax, %%r8\n"
"adcq %%rdx, %%r11\n"
"movq %%rsi, %%rax\n"
"adcq %%rcx, %%rcx\n"
"mulq %%r10\n"
"addq %%rax, %%r8\n" /* q1 output */
"movq %%r12, %%rax\n"
"adcq %%rdx, %%r11\n"
"adcq $0, %%rcx\n"
"xorq %%r12, %%r12\n"
"addq %%r15, %%r11\n"
"adcq %%r12, %%rcx\n"
"movq %%rax, %%r15\n"
"adcq %%r12, %%r12\n"
"mulq %%r14\n"
"addq %%rax, %%r11\n"
"adcq %%rdx, %%rcx\n"
"movq %%rsi, %%rax\n"
"adcq $0, %%r12\n"
"mulq %%r13\n"
"addq %%rax, %%r11\n"
"adcq %%rdx, %%rcx\n"
"adcq $0, %%r12\n"
"addq %%r10, %%r11\n" /* q2 output */
"adcq $0, %%rcx\n"
"adcq $0, %%r12\n"
"addq %%r9, %%rcx\n"
"movq %%rsi, %%rax\n"
"adcq $0, %%r12\n"
"mulq %%r14\n"
"addq %%rax, %%rcx\n"
"adcq %%rdx, %%r12\n"
"addq %%r13, %%rcx\n" /* q3 output */
"adcq $0, %%r12\n"
"movq %%r15, %%rax\n"
"addq %%r14, %%r12\n" /* q4 output */
/* %q1 input for next operation is %q0 output from prior / RBX as the connecting link
%q2 input for next operation is %q1 output from prior / r8 as the connecting link
%q3 input for next operation is %q2 output from prior / R11 as the connecting link
%q4 input for next operation is %q3 output from prior / RCX as the connecting link
%q5 input for next operation is %q4 output from prior / r12 as the connecting link */
/* Reduce 258 bits into 256. */
"mulq %%r12\n"
"addq %%rbx, %%rax\n"
"adcq $0, %%rdx\n"
"movq %%rax, %%r14\n" /* 0(q2) output */
"movq %%rdx, %%r9\n"
"xorq %%r10, %%r10\n"
"addq %%r8, %%r9\n"
"movq %%rsi, %%rax\n"
"adcq %%r10, %%r10\n"
"mulq %%r12\n"
"addq %%rax, %%r9\n" /* 8(q2) output */
"adcq %%rdx, %%r10\n"
"xor %%ebx, %%ebx\n"
"addq %%r12, %%r10\n"
"adcq %%rbx, %%rbx\n"
"movq $0xffffffffffffffff, %%r8\n"
"addq %%r11, %%r10\n" /* 16(q2) output */
"movq $0, %%r11\n"
"adcq $0, %%rbx\n"
"addq %%rcx, %%rbx\n" /* 24(q2) output */
"adcq $0, %%r11\n" /* c output */
/*FINAL REDUCTION */
/* r14 carries ex 0(%%rdi),
r9 carries ex 8(%%rdi),
r10 carries ex 16(%%rdi),
rbx carries ex 24(%%rdi)
r11 carries c */
"movq $0xbaaedce6af48a03b,%%r12\n"
"movq $0xbaaedce6af48a03a,%%rcx\n"
"movq $0xbfd25e8cd0364140,%%r13\n"
"cmp %%r8 ,%%rbx\n"
"setne %%dl\n"
"cmp $0xfffffffffffffffd,%%r10\n"
"setbe %%al\n"
"or %%eax,%%edx\n"
"cmp %%rcx,%%r9\n"
"setbe %%cl\n"
"or %%edx,%%ecx\n"
"cmp %%r12,%%r9\n"
"movzbl %%dl,%%edx\n"
"seta %%r12b\n"
"cmp %%r13,%%r14\n"
"movzbl %%cl,%%ecx\n"
"seta %%r13b\n"
"not %%ecx\n"
"not %%edx\n"
"or %%r13d,%%r12d\n"
"movzbl %%r12b,%%r12d\n"
"and %%r12d,%%ecx\n"
"xor %%r12d,%%r12d\n"
"cmp %%r8,%%r10\n"
"sete %%r12b\n"
"xor %%r13d,%%r13d\n"
"and %%r12d,%%edx\n"
"or %%edx,%%ecx\n"
"xor %%edx,%%edx\n"
"add %%ecx,%%r11d\n"
"imulq %%r11,%%r15\n"
"addq %%r15,%%r14\n"
"adcq %%rdx,%%r13\n"
"imulq %%r11,%%rsi\n"
"xorq %%r15,%%r15\n"
"xor %%eax,%%eax\n"
"movq %%r14,0(%q2)\n"
"xor %%edx,%%edx\n"
"addq %%r9,%%rsi\n"
"adcq %%rdx,%%rdx\n"
"addq %%rsi,%%r13\n"
"movq %%r13,8(%q2)\n"
"adcq %%rdx,%%r15\n"
"addq %%r11,%%r10\n"
"adcq %%rax,%%rax\n"
"addq %%r15,%%r10\n"
"movq %%r10,16(%q2)\n"
"adcq $0,%%rax\n"
"addq %%rbx,%%rax\n"
"movq %%rax,24(%q2)\n"
: "=D"(r)
: "S"(l), "D"(r), "n"(SECP256K1_N_C_0), "n"(SECP256K1_N_C_1)
: "rax", "rbx", "rcx", "rdx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");
#else
uint64_t l[8];
secp256k1_scalar_sqr_512(l, a);
secp256k1_scalar_reduce_512(r, l);
#endif
}
This, measured right now, gives
(original)
scalar_sqr: min 0.134us / avg 0.135us / max 0.136us
scalar_mul: min 0.141us / avg 0.143us / max 0.144us
scalar_inverse: min 40.5us / avg 40.6us / max 40.9us
(my hacked version - only gcc)
scalar_sqr: min 0.122us / avg 0.122us / max 0.122us
scalar_mul: min 0.126us / avg 0.127us / max 0.127us
scalar_inverse: min 36.7us / avg 36.9us / max 37.1us
The way the original code is (very readable for maintenance though - unlike my crap), if one dissassembles it, shows something like that:
1) mul512 or sqr512 starts and then writes its output to variables
2) Then we have pops and pushes for the next function which is reduce512
3) The reduce512 function imports the data from the outputs of #1
4) Reduce512 goes in 3 stages with each stage writing its own distinct output to variables and then the next stage imports it as its input. (The three stages can be streamlined by merging them - always using registers. The necessity for distinct output points and input points is then redundant / less moves and no need for variables).
5) As reduce512 ends, it puts its own output to variables
6) Final reduction imports the output of (5) and processes it.
My rationale was that if mul512 OR sqr 512+reduce512+final reduction go together, in one asm, one saves a lot of inputs/outputs and pops/pushes. Plus code size goes down significantly (1300 bytes => 1000 bytes) which leaves some extra L1 cache for other stuff. Reduce512/mul512/sqr512 still exists as code (altered) but they aren't really called. What gets called is the unified secp256k1_scalar_mul and the secp256k1_scalar_sqr - which have everything inside them. This was proof of concept so to speak, because I was seeing the disassembled output and I was like "AARRRGGHHH why can't one stage or function simply forward its results with the registers and there is all this pushing and popping and ram and variables".
For example, this is the behavior between (5) and (6) in the disassembled output of the original reduce512:
406246: 4c 89 4f 10 mov %r9,0x10(%rdi)
40624a: 4d 31 c9 xor %r9,%r9
40624d: 49 01 f0 add %rsi,%r8
406250: 49 83 d1 00 adc $0x0,%r9
406254: 4c 89 47 18 mov %r8,0x18(%rdi)
406258: 4c 89 cb mov %r9,%rbx
40625b: 4c 8b 5f 18 mov 0x18(%rdi),%r11
40625f: 48 8b 77 10 mov 0x10(%rdi),%rsi
My thoughts were like "ok, these ram moves are redundant and HAVE TO GO". Why should r8 write to ram and then get reimported from ram to r11? Why should r9 go to ram and get re-imported instead of going straight to rsi? Waste of time". I had the same reaction every time I spotted data going out and then getting moved back in as input - instead of being used as is).
Still, the source is very readable the way it is right now and the performance tradeoff is not that large compared to understanding what each thing does.