Pages:
Author

Topic: Theoretical minimum # of logic operations to perform privkey->pubkey(secp256k1) (Read 1259 times)

jr. member
Activity: 42
Merit: 18
There's article on http://binary.cr.yp.to/auth256.html indicating, that we can go to as low as to only 29*256=7424gates for 256bits mod-multiplication.
Alas, that page is about binary field multiplication, arithmetic in GF(2^n) rather than in GF(P).  Binary field arithmetic has special structure that make logic implementations especially efficient (and public key cryptography especially weak. Smiley ) and isn't what is needed here.

Thanks ! Sorry, I paid no attention to field structure. Before Your correction I was surprised that we can use only 7K gates for 256bits mod-mul.



In any case now we have such numbers for 256bits non-modular multiplying:
) school 130K gates
) Karatsuba 44K gates
) Bernstein 35K gates

In this article
https://homes.esat.kuleuven.be/~fvercaut/papers/bar_mont.pdf
I found info that mod-mul Barrett uses 33.15K gates, mod-mul Montgomery uses 33.4K gates.
These numbers related to field with anykind (2^(n−1) ≤ M < 2^n) module, but I am not sure that it's asynchronous(unrolled) schematic.

All this means, that initial j2002ba2 estimations can be lowered dramatically. (mine can be optimized of course too  Smiley Still going. )



Also I found article
http://www.bodrato.it/papers/WhatAboutToomCookMatricesOptimality.pdf
with formulaes to estimate Toom coefficients. But high-grade math terminology used. Could anyone help to estimate best Toom coefficients for 256bits ?!

I feel there's problem with complexity calculations. Usually formulaes are provided to calculate only the general, but not all logic. For example for School-algorithm formulae is O(n2). Using this formulae we can only calculate XOR gates = 65536, while additionaly we have the same number of AND gates.
staff
Activity: 4284
Merit: 8808
There's article on http://binary.cr.yp.to/auth256.html indicating, that we can go to as low as to only 29*256=7424gates for 256bits mod-multiplication.
Alas, that page is about binary field multiplication, arithmetic in GF(2^n) rather than in GF(P).  Binary field arithmetic has special structure that make logic implementations especially efficient (and public key cryptography especially weak. Smiley ) and isn't what is needed here.
jr. member
Activity: 42
Merit: 18
What about toom coefficients, then in libtommath library I found such table:

Split into n Parts         Exponent         Notes
2         1.584962501         This is Karatsuba Multiplication.
3         1.464973520         This is Toom-Cook Multiplication.
4         1.403677461         
5         1.365212389         
10         1.278753601         
100         1.149426538         
1000         1.100270931         
10000         1.075252070         
5.2. MULTIPLICATION 105                  
Figure 5.6: Asymptotic Running Time of Polynomial Basis Multiplication                  

But that's theory, in practice we will have not such good results.

For example, here http://binary.cr.yp.to/m.html
provided info that
school = 130thousands gates
classic Karatsuba = 44thousands gates, while in theory there should 10times smaller than school results...

jr. member
Activity: 42
Merit: 18
I think I found the answer.
There's article on http://binary.cr.yp.to/auth256.html indicating, that we can go to as low as to only 29*256=7424gates for 256bits mod-multiplication.

So we can use this number for estimations.
jr. member
Activity: 42
Merit: 18
Also I have question related to Toom. Toom can have different coefficients.

In wikipedia provided only these:
Toom-1     = Toom.km(1).kn(1) = classic school multiplication algorithm
Toom-1,5  = Toom.km(2).kn(1)
Toom-2     = Toom.km(2).kn(2) = Karatsuba
Toom-2,5  = Toom.km(3).kn(2)
Toom-3     = Toom.km(3).kn(2) - most frequently used Toom algorithm

But I can imagine we can go further:
Toom-3,5
Toom-4
Toom-4,5
Toom-5
...

And further, and further... Up to 255 splits each 1bit.
But as I understand the more we splitting number - the more carry bits we will have for adding.
So where's the gold balance point ? What are the best Toom coefficients(km, kn) for 256bit numbers ?!
jr. member
Activity: 42
Merit: 18
Toom is likely appropriate at these sizes.

I do not expect Montgomery to be useful except where there are many consecutive multiplies and squares-- such as performing an inversion with a power ladder.

In software at least delayed carry processing is a big performance increase, maybe in logic it wouldn't matter.

The secp256k1 field has special structure: 2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1

This simplifies reduction but I haven't really given any thought to what a dedicated logic implementation would look like.

On doubling:  You don't need to double because you should use a table per digit where each table entry has already been multiplied by a power of 2.

With that there is no need for doubling at all.  Though if you do have a need for doubling you can do it in 3mul 4sqr.

Computing  scalar times point will never have an infinity as an intermediate.  Infinity is functionally 0.   If you are adding up some number by summing its digits, you'll never encounter 0 in your incremental sum.  E.g. If I want to compute 1984G   4 + 80 + 900 + 1000  ... none of those sums can be zero.  If you consider it, you can see this applies to any number base.   Infinity can only be encountered if the scalar is larger than the group.


Ok, let's forget Montgomery-multiplication. Only two variants left: 511adds or Toom+reduct.

What about infinity, then I just trust You. But what about checking leading to doubleing - I just see it in bitcoin code, so it was good to ask to be sure.
That's very good, that no need to include doubleing calculations.

There are algorithms:
14.3.4 Reduction methods for moduli of special form
(A. Menezes, P. van Oorschot, S. Vanstone. Handbook of Applied Cryptography.)

Algorithm 10.25 Fast reduction for special form moduli
(H. Cohen, G. Frey, R. Avanzi, C. Doche, T. Lange, K. Nguyen, F. Vercauteren.
Handbook of Elliptic and Hyperelliptic Curve Cryptography.)

And I found golang code realizing them with description. I see they splitted secp256k1 modulo in a very interesting way.
https://github.com/btcsuite/btcd/blob/master/btcec/field.go
"   // The secp256k1 prime is equivalent to 2^256 - 4294968273, so it fits
   // this criteria.
   //
   // 4294968273 in field representation (base 2^26) is:
   // n[0] = 977
   // n[1] = 64
   // That is to say (2^26 * 64) + 977 = 4294968273"

So how do You think will it be much better than Barrett ?
While I don't understand this algorithm for special module, but I suppose if it much more efficient for calculating module, then the same algorithm principles can be used for multiplication. Isn't it ?!


And at all, how much more efficient Toom+Reduction than 511adds can be ? Will we save only +-10% gates or much more ?
If not bigger than 10%, then I think prefer in my final estimations to use 511adds, because they are much easier to understand and realize.
staff
Activity: 4284
Merit: 8808
Toom is likely appropriate at these sizes.

I do not expect Montgomery to be useful except where there are many consecutive multiplies and squares-- such as performing an inversion with a power ladder.

In software at least delayed carry processing is a big performance increase, maybe in logic it wouldn't matter.

The secp256k1 field has special structure: 2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1

This simplifies reduction but I haven't really given any thought to what a dedicated logic implementation would look like.

On doubling:  You don't need to double because you should use a table per digit where each table entry has already been multiplied by a power of 2.

With that there is no need for doubling at all.  Though if you do have a need for doubling you can do it in 3mul 4sqr.

Computing  scalar times point will never have an infinity as an intermediate.  Infinity is functionally 0.   If you are adding up some number by summing its digits, you'll never encounter 0 in your incremental sum.  E.g. If I want to compute 1984G   4 + 80 + 900 + 1000  ... none of those sums can be zero.  If you consider it, you can see this applies to any number base.   Infinity can only be encountered if the scalar is larger than the group.
full member
Activity: 206
Merit: 447

I have much more questions to step 9.
As we see there's checking leading to point doubleing or to infinity.


When the input is sanitized there are no infinity points. Step 9 is skipped.
jr. member
Activity: 42
Merit: 18
we still have a lot of multiplications. no matter affine-affine or affine-Jacobian used.

gmaxwell, You was providing link to PDF related to multiplications. but... it's really too difficult for me to read, high-grade math terminology used, I do not understand at least 70% there.  Undecided

But I already have general understanding in multiplication algorithms. I understand principles of Toom(and subset called Karatsuba.). I do not understand principles of Schonhage–Strassen algorithm, Furer's algorithm and many more, but they are in any case not effective for 2^256 numbers.
After multiplication made we need to reduce by module. I see Barrett reduction is the best suited for secp256k1 case, but I still don't understand principles of such.

Initially, I and j2002ba2 are using binary algorithm for estimations, 1mod-multiplication = 255+256=511mod-additions.

Also there's hybrid algorithm called Montgomery multiplication.


gmaxwell, how do You think, what algorithm is the most effective(=smaller amount of gates) for secp256k1 ?
1) 1mul = 511adds
2) multiply by Toom then reduce by Barrett
3) Montgomery multiplication
4) any more good ?!
jr. member
Activity: 42
Merit: 18
What about iterations count, then I think just better to select 768iterations.
Modinverse is only on final step, so are there 768 or only 68 iterations - not a big deal.
There is an informal proof up thread that 765 is enough, I haven't considered it carefully.  I find it a little concerning that he found a 760 iteration input so easily-- for safegcd we've found it fairly hard to find inputs close to the bound.

Perhaps for estimation purposes you could just assume 765 but I think that if I were going to fabricate this circuit I'd want to make a more formal proof of the bound.

aside,  Pieter has started writing an accessible writeup about how safegcd works.

Today I've programmed logging iterations with quite simple logic - logging largest iter2, iter3, iter4, average summ of them, and largest summ of them.
In fact, largest summ can be interpreted as unrolled iterations count.

In 6hours test-run on random inputs we get as high as 728. I feel running script for some weeks we can get as high as +-760.

Also I carefully(carefully!) re-read article related to 2/ln(k)*ln(A) formulae. It gives result for "a upper bound for the average number of iterations". It sound very interesting, upper bound of average.  Smiley

So........................  Roll Eyes Roll Eyes



Results:
(test#, lrgiter2, lrgiter3, lrgiter4, avgsum, lrgsum):
0 169 183 92 444 444
1 186 183 97 458 466
3 197 183 97 458 477
4 236 183 123 458 542
5 236 190 123 458 549
9 255 190 123 458 568
11 255 191 123 458 569
13 255 197 123 458 575
16 255 197 125 458 577
24 255 197 250 510 702
58 255 204 250 510 709
431 255 205 250 510 710
2952 255 209 250 510 714
8611 255 211 250 510 716
still trying... 2020-12-09 20:00:50.639906
54438 255 214 250 511 719
76276 255 215 250 511 720
325434 255 216 250 517 721
1085191 255 217 250 518 722
2017092 255 222 250 518 727
48955014 255 223 250 524 728
still trying... 2020-12-10 01:56:08.532261
jr. member
Activity: 42
Merit: 18
I missed something from J+A point addition - step 14 in 3.22. There should be additional 256-bit-add-minus-p & 256-bit-mux, for 1280 more gates. It doesn't change the outcome much.

Since there are many multiplexers, it might be beneficial to allow the Transmission gate.

A 256-bit multiplexer would be 512*TG + 1*NOT. This is 3 times less transistors.

Perhaps it would be better to count transistors instead of logic elements.
 NOT - 2
 NOR - 4
NAND - 4
 AND - 6
  OR - 6
 XOR - 8 (6 with PTL)
XNOR - 8
  TG - 2


Yeah, not a big deal.

I have much more questions to step 9.
As we see there's checking leading to point doubleing or to infinity.
The same checking in bitcoin code provided by gmaxwell leading to secp256k1_gej_double_var(r, a, rzr); or to secp256k1_gej_set_infinity(r);.
gmaxwell said, that infinity is not possible, I dumb in ECC, so let's trust to this statement.
And what about doubleing ? Can it be ignored ? If not, then I can't understand how to get smaller than 17mul per one point addition(input hybrid affine+J).

gmaxwell, please, clear up this question.



What about transmission gate and transistors. If to implement secp256k1 in hardware, then it's very good idea..
full member
Activity: 206
Merit: 447
I missed something from J+A point addition - step 14 in 3.22. There should be additional 256-bit-add-minus-p & 256-bit-mux, for 1280 more gates. It doesn't change the outcome much.

Since there are many multiplexers, it might be beneficial to allow the Transmission gate.

A 256-bit multiplexer would be 512*TG + 1*NOT. This is 3 times less transistors.

Perhaps it would be better to count transistors instead of logic elements.
 NOT - 2
 NOR - 4
NAND - 4
 AND - 6
  OR - 6
 XOR - 8 (6 with PTL)
XNOR - 8
  TG - 2
staff
Activity: 4284
Merit: 8808
What about iterations count, then I think just better to select 768iterations.
Modinverse is only on final step, so are there 768 or only 68 iterations - not a big deal.
There is an informal proof up thread that 765 is enough, I haven't considered it carefully.  I find it a little concerning that he found a 760 iteration input so easily-- for safegcd we've found it fairly hard to find inputs close to the bound.

Perhaps for estimation purposes you could just assume 765 but I think that if I were going to fabricate this circuit I'd want to make a more formal proof of the bound.

aside,  Pieter has started writing an accessible writeup about how safegcd works.
jr. member
Activity: 42
Merit: 18
hybrid affine+Jacobian. while for Jacobian-Jacobian addition I found only 17mul formulae.
Affine + Jacobian is actually what you want for this, your circuit never performs a GEJ+GEJ operation. (You might perform a GEJ doubling, but doubling is special and much simpler operation though as j2002ba2 points out, the doubling can and should be eliminated)

(Though there is a gej_gej addition in that codebase a little bit higher, which is 12 mul, 4 sqr-- but that isn't what you want).


After lot of thinking I've at last got the point how combining by 10bits and hybrid point addition works.
Funny, but process much-much easier than I was trying to imagine.
And yes, affine+jacobian is just fine.




What about iterations count, then I think just better to select 768iterations.
Modinverse is only on final step, so are there 768 or only 68 iterations - not a big deal.

staff
Activity: 4284
Merit: 8808
hybrid affine+Jacobian. while for Jacobian-Jacobian addition I found only 17mul formulae.
Affine + Jacobian is actually what you want for this, your circuit never performs a GEJ+GEJ operation. (You might perform a GEJ doubling, but doubling is special and much simpler operation though as j2002ba2 points out, the doubling can and should be eliminated)

(Though there is a gej_gej addition in that codebase a little bit higher, which is 12 mul, 4 sqr-- but that isn't what you want).
jr. member
Activity: 42
Merit: 18
the most optimized version I found has 17multiplications.
If you look at the links in my prior posts, I linked to a GEJ+GE that uses 8mul+3sqr.


Yes, bitcoin code. But as I understand bitcoin uses hybrid affine+Jacobian. while for Jacobian-Jacobian addition I found only 17mul formulae.

In fact, ECC is difficult for me, my brain melts.

I don't understand why, but I feel problem to calculate privkey->pubkey using hybrid formulae if there's much leading same bits.
But the more I think about it - the more my brain is going to infinite loop.  Sad



What about iterations of GCD, I have some ideas to verify in Python, so I will write later.
staff
Activity: 4284
Merit: 8808
the most optimized version I found has 17multiplications.
If you look at the links in my prior posts, I linked to a GEJ+GE that uses 8mul+3sqr.
jr. member
Activity: 42
Merit: 18
jr. member
Activity: 42
Merit: 18

I still can't understand how You optimizing gates count by combining bits.


The most naive point addition is with memory 1 point (G). Then we double the generator and add points appropriately. However all doubled points are constant, hence we could precompute all 255 doublings. The result is 256 points memory, no doubling. Since point addition is more expensive compared to implementing memory through multiplexers, we could go further. Split the private key in groups of 2 bits, precompute for each group the four combinations (resulting in 4*128 points memory), and feed the appropriate points to 127 point adders.




1. It is quite possible, that I have a mistake here, missing one multiplication. Anyway Jacobian+affine performs better.
2. "Guide to Elliptic Curve Cryptography", Algorithm 3.22. It has 8 multiplications, 3 squares, 6 subtractions.

For binary euclidean gcd I get 760 iterations, when running with input 0x8000...00 and p.


Big thanks for detailed answer !

Yeah, I understand Your idea. When I was constructing my trillionized scheme I was estimating such grouping.
I got:
group by 1 = 256 pre-computations
group by 2 = about 32768 pre-computations
group by 3 = about 2.7mln pre-computation
after seen agressively exponential grow I understood that it's bad idea.

Now I understand that I've made a big combinatorical logic mistake.

After mistake detected(thanks to You), now I can say, of course such grouping is great idea, and yes, more than likely optimal grouping is by ~10bits.


I found "Algorithm 3.22". I see it's hybrid for affine+Jacobian. But I can't understand one thing.
Ok, let's pre-compute points grouping up to 11bits.
Then let's take privkey 111111111111111111 (2+16 leading ones) {skipped some bits} 100000000000000001 (16 leeding zeros)
How can be pubkey calculated using this algorithm without re-converts affine<->Jacobian ?!

(from other side, I can agree, that using privkey with more than 11 leading same bits is not good idea, but...)


Now what about iterations. Ouch... my test-runs shows about 384iterations. but... for such multi-conditional multi-loop algorith it's quite complex to count iterations.
My idea is that I need to place special triggers on each loop, then log results to array, then analyse array in detecting consequental and-or parallel triggers switching.
But I am not too good in Python or Ruby, so it will took a time to implement...

As result I suppose that there will be either 384 or 2*384=768 max iterations( if max input number = 2^256).
of, course in secp256k1 module is slightly smaller than 2^256, so there can be sligthly smaller than 768iters...
...but should be 512 !  Cool

full member
Activity: 206
Merit: 447

I still can't understand how You optimizing gates count by combining bits.


The most naive point addition is with memory 1 point (G). Then we double the generator and add points appropriately. However all doubled points are constant, hence we could precompute all 255 doublings. The result is 256 points memory, no doubling. Since point addition is more expensive compared to implementing memory through multiplexers, we could go further. Split the private key in groups of 2 bits, precompute for each group the four combinations (resulting in 4*128 points memory), and feed the appropriate points to 127 point adders.

i.e. at lsb position
bits
00 - 0*G
01 - 1*G
10 - 2*G
11 - 3*G

next group would be
00 - 0*G
01 - 4*G
10 - 8*G
11 - 12*G

and so on.

If we use groups of 4 bits - 24 points memory per group.

It turns out that with affine points best is to use groups of 11 bits, plus 3 bits for the last group.
For Jacobian+affine point additions best is groups of 10 bits, plus 6 bits for the last group.


Some more questions.
1) what is Your formulae for affine point addition with only two multiplications ?
the most optimized version I found has multiplication+multiplication+squaring=3multiplications.
2) what is Your formulae for Jacobian point addition with only 11multiplications ?
the most optimized version I found has 17multiplications.

Also I would like to note that by test-runs for affine we have +-384iterations, but not +-760.


1. It is quite possible, that I have a mistake here, missing one multiplication. Anyway Jacobian+affine performs better.
2. "Guide to Elliptic Curve Cryptography", Algorithm 3.22. It has 8 multiplications, 3 squares, 6 subtractions.

For binary euclidean gcd I get 760 iterations, when running with input 0x8000...00 and p.

Pages:
Jump to: