I currently have an i5-8250u @ 3.4ghz laptop and I tried MULXing / ADCxing / ADOXing the field 5x52 asm. I found ADCXs/ADOXs to be useless in speeding up short parallel chains. Performance was same or worse than ADD/ADC.
MULX, on paper, is 4 cycles vs classic MUL at 3 cycles. The benefit though is that you can continue doing MULxing on other registers, thus not waiting for the rax/rdx pair to be added to go on.
So results are in. Test with gcc8 (I tried gcc7 + clang, similar performance improvements found) and parameters as follows (default cflags):
./configure --enable-endomorphism --enable-openssl-tests=no
Default run (normal asm / unmodified secp):
perf stat -d ./bench_verify
ecdsa_verify: min 46.8us / avg 46.8us / max 47.1us
9377.185057 task-clock (msec) # 1.000 CPUs utilized
27 context-switches # 0.003 K/sec
2 cpu-migrations # 0.000 K/sec
3,284 page-faults # 0.350 K/sec
31,765,926,255 cycles # 3.388 GHz (62.49%)
85,725,991,961 instructions # 2.70 insn per cycle (75.00%)
1,822,885,107 branches # 194.396 M/sec (75.00%)
64,368,756 branch-misses # 3.53% of all branches (75.00%)
13,413,927,724 L1-dcache-loads # 1430.486 M/sec (75.03%)
7,487,706 L1-dcache-load-misses # 0.06% of all L1-dcache hits (75.08%)
4,131,670 LLC-loads # 0.441 M/sec (49.96%)
93,400 LLC-load-misses # 2.26% of all LL-cache hits (49.92%)
9.376614679 seconds time elapsed./bench_internal
scalar_add: min 0.00982us / avg 0.00996us / max 0.0103us
scalar_negate: min 0.00376us / avg 0.00381us / max 0.00411us
scalar_sqr: min 0.0389us / avg 0.0393us / max 0.0405us
scalar_mul: min 0.0395us / avg 0.0398us / max 0.0413us
scalar_split: min 0.175us / avg 0.178us / max 0.186us
scalar_inverse: min 11.3us / avg 11.5us / max 11.7us
scalar_inverse_var: min 2.65us / avg 2.70us / max 2.85us
field_normalize: min 0.00988us / avg 0.00995us / max 0.0102us
field_normalize_weak: min 0.00404us / avg 0.00405us / max 0.00411us
field_sqr: min 0.0187us / avg 0.0189us / max 0.0194us
field_mul: min 0.0233us / avg 0.0236us / max 0.0254usfield_inverse: min 5.10us / avg 5.11us / max 5.14us
field_inverse_var: min 2.61us / avg 2.62us / max 2.69us
field_sqrt: min 5.07us / avg 5.08us / max 5.13us
group_double_var: min 0.149us / avg 0.150us / max 0.153usgroup_add_var: min 0.337us / avg 0.338us / max 0.341us
group_add_affine: min 0.288us / avg 0.289us / max 0.292us
group_add_affine_var: min 0.243us / avg 0.244us / max 0.246us
group_jacobi_var: min 0.212us / avg 0.219us / max 0.251us
wnaf_const: min 0.0799us / avg 0.0830us / max 0.104us
ecmult_wnaf: min 0.528us / avg 0.532us / max 0.552us
hash_sha256: min 0.324us / avg 0.328us / max 0.345us
hash_hmac_sha256: min 1.26us / avg 1.27us / max 1.30us
hash_rfc6979_hmac_sha256: min 7.00us / avg 7.00us / max 7.03us
context_verify: min 7007us / avg 7038us / max 7186uscontext_sign: min 33.4us / avg 34.1us / max 36.7us
num_jacobi: min 0.109us / avg 0.111us / max 0.126us
After MULXing:
Custom field 5x52 asm with MULXs:
perf stat -d ./bench_verify
ecdsa_verify: min 39.9us / avg 39.9us / max 40.2us
8003.494101 task-clock (msec) # 1.000 CPUs utilized
28 context-switches # 0.003 K/sec
0 cpu-migrations # 0.000 K/sec
3,278 page-faults # 0.410 K/sec
27,113,041,440 cycles # 3.388 GHz (62.46%)
70,772,097,848 instructions # 2.61 insn per cycle (75.01%)
1,872,709,155 branches # 233.986 M/sec (75.01%)
63,567,635 branch-misses # 3.39% of all branches (75.01%)
20,812,623,788 L1-dcache-loads # 2600.442 M/sec (75.01%)
7,187,062 L1-dcache-load-misses # 0.03% of all L1-dcache hits (75.01%)
4,098,304 LLC-loads # 0.512 M/sec (49.98%)
97,108 LLC-load-misses # 2.37% of all LL-cache hits (49.98%)
8.003690210 seconds time elapsed./bench_internal
scalar_add: min 0.00982us / avg 0.00988us / max 0.0100us
scalar_negate: min 0.00376us / avg 0.00377us / max 0.00384us
scalar_sqr: min 0.0389us / avg 0.0391us / max 0.0402us
scalar_mul: min 0.0395us / avg 0.0397us / max 0.0412us
scalar_split: min 0.176us / avg 0.178us / max 0.184us
scalar_inverse: min 11.3us / avg 11.4us / max 11.6us
scalar_inverse_var: min 2.66us / avg 2.71us / max 3.01us
field_normalize: min 0.00988us / avg 0.00998us / max 0.0104us
field_normalize_weak: min 0.00404us / avg 0.00406us / max 0.00415us
field_sqr: min 0.0153us / avg 0.0155us / max 0.0164us
field_mul: min 0.0172us / avg 0.0175us / max 0.0182usfield_inverse: min 4.17us / avg 4.18us / max 4.22us
field_inverse_var: min 2.62us / avg 2.63us / max 2.65us
field_sqrt: min 4.12us / avg 4.12us / max 4.13us
group_double_var: min 0.127us / avg 0.128us / max 0.131usgroup_add_var: min 0.270us / avg 0.270us / max 0.272us
group_add_affine: min 0.250us / avg 0.250us / max 0.252us
group_add_affine_var: min 0.200us / avg 0.200us / max 0.201us
group_jacobi_var: min 0.212us / avg 0.214us / max 0.219us
wnaf_const: min 0.0799us / avg 0.0802us / max 0.0817us
ecmult_wnaf: min 0.528us / avg 0.535us / max 0.569us
hash_sha256: min 0.324us / avg 0.334us / max 0.355us
hash_hmac_sha256: min 1.27us / avg 1.27us / max 1.31us
hash_rfc6979_hmac_sha256: min 6.99us / avg 6.99us / max 7.03us
context_verify: min 5934us / avg 5966us / max 6127uscontext_sign: min 30.9us / avg 31.3us / max 33.0us
num_jacobi: min 0.111us / avg 0.113us / max 0.120us
From 9.37secs to 8.00 secs (0.85x). Field_mul at 0.73x. Field_sqr at 0.81.
After doing some rearranging (c file) of group_impl.h:perf stat -d ./bench_verify
ecdsa_verify: min 38.2us / avg 38.3us / max 38.7us
7675.837387 task-clock (msec) # 1.000 CPUs utilized
37 context-switches # 0.005 K/sec
0 cpu-migrations # 0.000 K/sec
3,268 page-faults # 0.426 K/sec
25,993,738,895 cycles # 3.386 GHz (62.48%)
70,649,153,999 instructions # 2.72 insn per cycle (74.99%)
1,872,833,433 branches # 243.991 M/sec (74.99%)
64,040,465 branch-misses # 3.42% of all branches (74.99%)
20,969,673,428 L1-dcache-loads # 2731.907 M/sec (75.04%)
7,260,544 L1-dcache-load-misses # 0.03% of all L1-dcache hits (75.09%)
4,076,705 LLC-loads # 0.531 M/sec (49.97%)
110,695 LLC-load-misses # 2.72% of all LL-cache hits (49.92%)
7.675396172 seconds time elapsed./bench_internal
scalar_add: min 0.00980us / avg 0.00984us / max 0.00987us
scalar_negate: min 0.00376us / avg 0.00377us / max 0.00382us
scalar_sqr: min 0.0389us / avg 0.0391us / max 0.0396us
scalar_mul: min 0.0392us / avg 0.0396us / max 0.0403us
scalar_split: min 0.176us / avg 0.177us / max 0.181us
scalar_inverse: min 11.3us / avg 11.4us / max 11.7us
scalar_inverse_var: min 2.65us / avg 2.69us / max 2.93us
field_normalize: min 0.00991us / avg 0.00999us / max 0.0103us
field_normalize_weak: min 0.00404us / avg 0.00405us / max 0.00414us
field_sqr: min 0.0153us / avg 0.0154us / max 0.0158us
field_mul: min 0.0172us / avg 0.0173us / max 0.0175usfield_inverse: min 4.17us / avg 4.18us / max 4.20us
field_inverse_var: min 2.62us / avg 2.62us / max 2.65us
field_sqrt: min 4.12us / avg 4.13us / max 4.16us
group_double_var: min 0.121us / avg 0.122us / max 0.123usgroup_add_var: min 0.267us / avg 0.268us / max 0.271us
group_add_affine: min 0.249us / avg 0.249us / max 0.252us
group_add_affine_var: min 0.192us / avg 0.193us / max 0.196us
group_jacobi_var: min 0.211us / avg 0.214us / max 0.224us
wnaf_const: min 0.0799us / avg 0.0802us / max 0.0818us
ecmult_wnaf: min 0.528us / avg 0.534us / max 0.574us
hash_sha256: min 0.324us / avg 0.327us / max 0.341us
hash_hmac_sha256: min 1.26us / avg 1.27us / max 1.28us
hash_rfc6979_hmac_sha256: min 6.98us / avg 6.98us / max 7.01us
context_verify: min 5885us / avg 5916us / max 6039uscontext_sign: min 30.9us / avg 31.5us / max 35.9us
num_jacobi: min 0.110us / avg 0.111us / max 0.122us
Time spent on ./bench_verify = 0.81x, from 31.76mn -> 25.99mn cycles.Most of these are hardware specific though.
I have a few things I'm currently tampering with, getting it down to 0.78x. A special sqr_inner2 function which has 3 parameters (r - output, a input, counter). The counter is used for the number of times one wants to square the same number, without recalling the function - the function does that on it's own. ~10% of the time spent in ./bench verify are looped field squares. It takes inversions/squaring from 4.2us down to 3.9us. The gain is much larger on less optimized sqr functions. The c int128 5x52 impl sqr, goes 20%+ faster if looped internally with a counter.
[1] mulx-asm:
https://github.com/Alex-GR/secp256k1/blob/master/field_5x52_asm_impl.h[2] reordered group_impl.h (most gains in double_var and is probably not-hardware specific since I'm seeing 2% gains even on a 10-yr old quad core):
https://github.com/Alex-GR/secp256k1/blob/master/group_impl.hCode is free to use/reuse/get ideas/implemented, etc etc - although I don't claim it's heavily tested, or even safe. It does pass the tests though.