Author

Topic: Theoretical minimum # of logic operations to perform privkey->pubkey(secp256k1) (Read 1279 times)

jr. member
Activity: 42
Merit: 18
There's article on http://binary.cr.yp.to/auth256.html indicating, that we can go to as low as to only 29*256=7424gates for 256bits mod-multiplication.
Alas, that page is about binary field multiplication, arithmetic in GF(2^n) rather than in GF(P).  Binary field arithmetic has special structure that make logic implementations especially efficient (and public key cryptography especially weak. Smiley ) and isn't what is needed here.

Thanks ! Sorry, I paid no attention to field structure. Before Your correction I was surprised that we can use only 7K gates for 256bits mod-mul.



In any case now we have such numbers for 256bits non-modular multiplying:
) school 130K gates
) Karatsuba 44K gates
) Bernstein 35K gates

In this article
https://homes.esat.kuleuven.be/~fvercaut/papers/bar_mont.pdf
I found info that mod-mul Barrett uses 33.15K gates, mod-mul Montgomery uses 33.4K gates.
These numbers related to field with anykind (2^(n−1) ≤ M < 2^n) module, but I am not sure that it's asynchronous(unrolled) schematic.

All this means, that initial j2002ba2 estimations can be lowered dramatically. (mine can be optimized of course too  Smiley Still going. )



Also I found article
http://www.bodrato.it/papers/WhatAboutToomCookMatricesOptimality.pdf
with formulaes to estimate Toom coefficients. But high-grade math terminology used. Could anyone help to estimate best Toom coefficients for 256bits ?!

I feel there's problem with complexity calculations. Usually formulaes are provided to calculate only the general, but not all logic. For example for School-algorithm formulae is O(n2). Using this formulae we can only calculate XOR gates = 65536, while additionaly we have the same number of AND gates.
staff
Activity: 4326
Merit: 8951
There's article on http://binary.cr.yp.to/auth256.html indicating, that we can go to as low as to only 29*256=7424gates for 256bits mod-multiplication.
Alas, that page is about binary field multiplication, arithmetic in GF(2^n) rather than in GF(P).  Binary field arithmetic has special structure that make logic implementations especially efficient (and public key cryptography especially weak. Smiley ) and isn't what is needed here.
jr. member
Activity: 42
Merit: 18
What about toom coefficients, then in libtommath library I found such table:

Split into n Parts         Exponent         Notes
2         1.584962501         This is Karatsuba Multiplication.
3         1.464973520         This is Toom-Cook Multiplication.
4         1.403677461         
5         1.365212389         
10         1.278753601         
100         1.149426538         
1000         1.100270931         
10000         1.075252070         
5.2. MULTIPLICATION 105                  
Figure 5.6: Asymptotic Running Time of Polynomial Basis Multiplication                  

But that's theory, in practice we will have not such good results.

For example, here http://binary.cr.yp.to/m.html
provided info that
school = 130thousands gates
classic Karatsuba = 44thousands gates, while in theory there should 10times smaller than school results...

jr. member
Activity: 42
Merit: 18
I think I found the answer.
There's article on http://binary.cr.yp.to/auth256.html indicating, that we can go to as low as to only 29*256=7424gates for 256bits mod-multiplication.

So we can use this number for estimations.
jr. member
Activity: 42
Merit: 18
Also I have question related to Toom. Toom can have different coefficients.

In wikipedia provided only these:
Toom-1     = Toom.km(1).kn(1) = classic school multiplication algorithm
Toom-1,5  = Toom.km(2).kn(1)
Toom-2     = Toom.km(2).kn(2) = Karatsuba
Toom-2,5  = Toom.km(3).kn(2)
Toom-3     = Toom.km(3).kn(2) - most frequently used Toom algorithm

But I can imagine we can go further:
Toom-3,5
Toom-4
Toom-4,5
Toom-5
...

And further, and further... Up to 255 splits each 1bit.
But as I understand the more we splitting number - the more carry bits we will have for adding.
So where's the gold balance point ? What are the best Toom coefficients(km, kn) for 256bit numbers ?!
jr. member
Activity: 42
Merit: 18
Toom is likely appropriate at these sizes.

I do not expect Montgomery to be useful except where there are many consecutive multiplies and squares-- such as performing an inversion with a power ladder.

In software at least delayed carry processing is a big performance increase, maybe in logic it wouldn't matter.

The secp256k1 field has special structure: 2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1

This simplifies reduction but I haven't really given any thought to what a dedicated logic implementation would look like.

On doubling:  You don't need to double because you should use a table per digit where each table entry has already been multiplied by a power of 2.

With that there is no need for doubling at all.  Though if you do have a need for doubling you can do it in 3mul 4sqr.

Computing  scalar times point will never have an infinity as an intermediate.  Infinity is functionally 0.   If you are adding up some number by summing its digits, you'll never encounter 0 in your incremental sum.  E.g. If I want to compute 1984G   4 + 80 + 900 + 1000  ... none of those sums can be zero.  If you consider it, you can see this applies to any number base.   Infinity can only be encountered if the scalar is larger than the group.


Ok, let's forget Montgomery-multiplication. Only two variants left: 511adds or Toom+reduct.

What about infinity, then I just trust You. But what about checking leading to doubleing - I just see it in bitcoin code, so it was good to ask to be sure.
That's very good, that no need to include doubleing calculations.

There are algorithms:
14.3.4 Reduction methods for moduli of special form
(A. Menezes, P. van Oorschot, S. Vanstone. Handbook of Applied Cryptography.)

Algorithm 10.25 Fast reduction for special form moduli
(H. Cohen, G. Frey, R. Avanzi, C. Doche, T. Lange, K. Nguyen, F. Vercauteren.
Handbook of Elliptic and Hyperelliptic Curve Cryptography.)

And I found golang code realizing them with description. I see they splitted secp256k1 modulo in a very interesting way.
https://github.com/btcsuite/btcd/blob/master/btcec/field.go
"   // The secp256k1 prime is equivalent to 2^256 - 4294968273, so it fits
   // this criteria.
   //
   // 4294968273 in field representation (base 2^26) is:
   // n[0] = 977
   // n[1] = 64
   // That is to say (2^26 * 64) + 977 = 4294968273"

So how do You think will it be much better than Barrett ?
While I don't understand this algorithm for special module, but I suppose if it much more efficient for calculating module, then the same algorithm principles can be used for multiplication. Isn't it ?!


And at all, how much more efficient Toom+Reduction than 511adds can be ? Will we save only +-10% gates or much more ?
If not bigger than 10%, then I think prefer in my final estimations to use 511adds, because they are much easier to understand and realize.
staff
Activity: 4326
Merit: 8951
Toom is likely appropriate at these sizes.

I do not expect Montgomery to be useful except where there are many consecutive multiplies and squares-- such as performing an inversion with a power ladder.

In software at least delayed carry processing is a big performance increase, maybe in logic it wouldn't matter.

The secp256k1 field has special structure: 2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1

This simplifies reduction but I haven't really given any thought to what a dedicated logic implementation would look like.

On doubling:  You don't need to double because you should use a table per digit where each table entry has already been multiplied by a power of 2.

With that there is no need for doubling at all.  Though if you do have a need for doubling you can do it in 3mul 4sqr.

Computing  scalar times point will never have an infinity as an intermediate.  Infinity is functionally 0.   If you are adding up some number by summing its digits, you'll never encounter 0 in your incremental sum.  E.g. If I want to compute 1984G   4 + 80 + 900 + 1000  ... none of those sums can be zero.  If you consider it, you can see this applies to any number base.   Infinity can only be encountered if the scalar is larger than the group.
full member
Activity: 206
Merit: 450

I have much more questions to step 9.
As we see there's checking leading to point doubleing or to infinity.


When the input is sanitized there are no infinity points. Step 9 is skipped.
jr. member
Activity: 42
Merit: 18
we still have a lot of multiplications. no matter affine-affine or affine-Jacobian used.

gmaxwell, You was providing link to PDF related to multiplications. but... it's really too difficult for me to read, high-grade math terminology used, I do not understand at least 70% there.  Undecided

But I already have general understanding in multiplication algorithms. I understand principles of Toom(and subset called Karatsuba.). I do not understand principles of Schonhage–Strassen algorithm, Furer's algorithm and many more, but they are in any case not effective for 2^256 numbers.
After multiplication made we need to reduce by module. I see Barrett reduction is the best suited for secp256k1 case, but I still don't understand principles of such.

Initially, I and j2002ba2 are using binary algorithm for estimations, 1mod-multiplication = 255+256=511mod-additions.

Also there's hybrid algorithm called Montgomery multiplication.


gmaxwell, how do You think, what algorithm is the most effective(=smaller amount of gates) for secp256k1 ?
1) 1mul = 511adds
2) multiply by Toom then reduce by Barrett
3) Montgomery multiplication
4) any more good ?!
jr. member
Activity: 42
Merit: 18
What about iterations count, then I think just better to select 768iterations.
Modinverse is only on final step, so are there 768 or only 68 iterations - not a big deal.
There is an informal proof up thread that 765 is enough, I haven't considered it carefully.  I find it a little concerning that he found a 760 iteration input so easily-- for safegcd we've found it fairly hard to find inputs close to the bound.

Perhaps for estimation purposes you could just assume 765 but I think that if I were going to fabricate this circuit I'd want to make a more formal proof of the bound.

aside,  Pieter has started writing an accessible writeup about how safegcd works.

Today I've programmed logging iterations with quite simple logic - logging largest iter2, iter3, iter4, average summ of them, and largest summ of them.
In fact, largest summ can be interpreted as unrolled iterations count.

In 6hours test-run on random inputs we get as high as 728. I feel running script for some weeks we can get as high as +-760.

Also I carefully(carefully!) re-read article related to 2/ln(k)*ln(A) formulae. It gives result for "a upper bound for the average number of iterations". It sound very interesting, upper bound of average.  Smiley

So........................  Roll Eyes Roll Eyes



Results:
(test#, lrgiter2, lrgiter3, lrgiter4, avgsum, lrgsum):
0 169 183 92 444 444
1 186 183 97 458 466
3 197 183 97 458 477
4 236 183 123 458 542
5 236 190 123 458 549
9 255 190 123 458 568
11 255 191 123 458 569
13 255 197 123 458 575
16 255 197 125 458 577
24 255 197 250 510 702
58 255 204 250 510 709
431 255 205 250 510 710
2952 255 209 250 510 714
8611 255 211 250 510 716
still trying... 2020-12-09 20:00:50.639906
54438 255 214 250 511 719
76276 255 215 250 511 720
325434 255 216 250 517 721
1085191 255 217 250 518 722
2017092 255 222 250 518 727
48955014 255 223 250 524 728
still trying... 2020-12-10 01:56:08.532261
jr. member
Activity: 42
Merit: 18
I missed something from J+A point addition - step 14 in 3.22. There should be additional 256-bit-add-minus-p & 256-bit-mux, for 1280 more gates. It doesn't change the outcome much.

Since there are many multiplexers, it might be beneficial to allow the Transmission gate.

A 256-bit multiplexer would be 512*TG + 1*NOT. This is 3 times less transistors.

Perhaps it would be better to count transistors instead of logic elements.
 NOT - 2
 NOR - 4
NAND - 4
 AND - 6
  OR - 6
 XOR - 8 (6 with PTL)
XNOR - 8
  TG - 2


Yeah, not a big deal.

I have much more questions to step 9.
As we see there's checking leading to point doubleing or to infinity.
The same checking in bitcoin code provided by gmaxwell leading to secp256k1_gej_double_var(r, a, rzr); or to secp256k1_gej_set_infinity(r);.
gmaxwell said, that infinity is not possible, I dumb in ECC, so let's trust to this statement.
And what about doubleing ? Can it be ignored ? If not, then I can't understand how to get smaller than 17mul per one point addition(input hybrid affine+J).

gmaxwell, please, clear up this question.



What about transmission gate and transistors. If to implement secp256k1 in hardware, then it's very good idea..
full member
Activity: 206
Merit: 450
I missed something from J+A point addition - step 14 in 3.22. There should be additional 256-bit-add-minus-p & 256-bit-mux, for 1280 more gates. It doesn't change the outcome much.

Since there are many multiplexers, it might be beneficial to allow the Transmission gate.

A 256-bit multiplexer would be 512*TG + 1*NOT. This is 3 times less transistors.

Perhaps it would be better to count transistors instead of logic elements.
 NOT - 2
 NOR - 4
NAND - 4
 AND - 6
  OR - 6
 XOR - 8 (6 with PTL)
XNOR - 8
  TG - 2
staff
Activity: 4326
Merit: 8951
What about iterations count, then I think just better to select 768iterations.
Modinverse is only on final step, so are there 768 or only 68 iterations - not a big deal.
There is an informal proof up thread that 765 is enough, I haven't considered it carefully.  I find it a little concerning that he found a 760 iteration input so easily-- for safegcd we've found it fairly hard to find inputs close to the bound.

Perhaps for estimation purposes you could just assume 765 but I think that if I were going to fabricate this circuit I'd want to make a more formal proof of the bound.

aside,  Pieter has started writing an accessible writeup about how safegcd works.
jr. member
Activity: 42
Merit: 18
hybrid affine+Jacobian. while for Jacobian-Jacobian addition I found only 17mul formulae.
Affine + Jacobian is actually what you want for this, your circuit never performs a GEJ+GEJ operation. (You might perform a GEJ doubling, but doubling is special and much simpler operation though as j2002ba2 points out, the doubling can and should be eliminated)

(Though there is a gej_gej addition in that codebase a little bit higher, which is 12 mul, 4 sqr-- but that isn't what you want).


After lot of thinking I've at last got the point how combining by 10bits and hybrid point addition works.
Funny, but process much-much easier than I was trying to imagine.
And yes, affine+jacobian is just fine.




What about iterations count, then I think just better to select 768iterations.
Modinverse is only on final step, so are there 768 or only 68 iterations - not a big deal.

staff
Activity: 4326
Merit: 8951
hybrid affine+Jacobian. while for Jacobian-Jacobian addition I found only 17mul formulae.
Affine + Jacobian is actually what you want for this, your circuit never performs a GEJ+GEJ operation. (You might perform a GEJ doubling, but doubling is special and much simpler operation though as j2002ba2 points out, the doubling can and should be eliminated)

(Though there is a gej_gej addition in that codebase a little bit higher, which is 12 mul, 4 sqr-- but that isn't what you want).
jr. member
Activity: 42
Merit: 18
the most optimized version I found has 17multiplications.
If you look at the links in my prior posts, I linked to a GEJ+GE that uses 8mul+3sqr.


Yes, bitcoin code. But as I understand bitcoin uses hybrid affine+Jacobian. while for Jacobian-Jacobian addition I found only 17mul formulae.

In fact, ECC is difficult for me, my brain melts.

I don't understand why, but I feel problem to calculate privkey->pubkey using hybrid formulae if there's much leading same bits.
But the more I think about it - the more my brain is going to infinite loop.  Sad



What about iterations of GCD, I have some ideas to verify in Python, so I will write later.
staff
Activity: 4326
Merit: 8951
the most optimized version I found has 17multiplications.
If you look at the links in my prior posts, I linked to a GEJ+GE that uses 8mul+3sqr.
jr. member
Activity: 42
Merit: 18
jr. member
Activity: 42
Merit: 18

I still can't understand how You optimizing gates count by combining bits.


The most naive point addition is with memory 1 point (G). Then we double the generator and add points appropriately. However all doubled points are constant, hence we could precompute all 255 doublings. The result is 256 points memory, no doubling. Since point addition is more expensive compared to implementing memory through multiplexers, we could go further. Split the private key in groups of 2 bits, precompute for each group the four combinations (resulting in 4*128 points memory), and feed the appropriate points to 127 point adders.




1. It is quite possible, that I have a mistake here, missing one multiplication. Anyway Jacobian+affine performs better.
2. "Guide to Elliptic Curve Cryptography", Algorithm 3.22. It has 8 multiplications, 3 squares, 6 subtractions.

For binary euclidean gcd I get 760 iterations, when running with input 0x8000...00 and p.


Big thanks for detailed answer !

Yeah, I understand Your idea. When I was constructing my trillionized scheme I was estimating such grouping.
I got:
group by 1 = 256 pre-computations
group by 2 = about 32768 pre-computations
group by 3 = about 2.7mln pre-computation
after seen agressively exponential grow I understood that it's bad idea.

Now I understand that I've made a big combinatorical logic mistake.

After mistake detected(thanks to You), now I can say, of course such grouping is great idea, and yes, more than likely optimal grouping is by ~10bits.


I found "Algorithm 3.22". I see it's hybrid for affine+Jacobian. But I can't understand one thing.
Ok, let's pre-compute points grouping up to 11bits.
Then let's take privkey 111111111111111111 (2+16 leading ones) {skipped some bits} 100000000000000001 (16 leeding zeros)
How can be pubkey calculated using this algorithm without re-converts affine<->Jacobian ?!

(from other side, I can agree, that using privkey with more than 11 leading same bits is not good idea, but...)


Now what about iterations. Ouch... my test-runs shows about 384iterations. but... for such multi-conditional multi-loop algorith it's quite complex to count iterations.
My idea is that I need to place special triggers on each loop, then log results to array, then analyse array in detecting consequental and-or parallel triggers switching.
But I am not too good in Python or Ruby, so it will took a time to implement...

As result I suppose that there will be either 384 or 2*384=768 max iterations( if max input number = 2^256).
of, course in secp256k1 module is slightly smaller than 2^256, so there can be sligthly smaller than 768iters...
...but should be 512 !  Cool

full member
Activity: 206
Merit: 450

I still can't understand how You optimizing gates count by combining bits.


The most naive point addition is with memory 1 point (G). Then we double the generator and add points appropriately. However all doubled points are constant, hence we could precompute all 255 doublings. The result is 256 points memory, no doubling. Since point addition is more expensive compared to implementing memory through multiplexers, we could go further. Split the private key in groups of 2 bits, precompute for each group the four combinations (resulting in 4*128 points memory), and feed the appropriate points to 127 point adders.

i.e. at lsb position
bits
00 - 0*G
01 - 1*G
10 - 2*G
11 - 3*G

next group would be
00 - 0*G
01 - 4*G
10 - 8*G
11 - 12*G

and so on.

If we use groups of 4 bits - 24 points memory per group.

It turns out that with affine points best is to use groups of 11 bits, plus 3 bits for the last group.
For Jacobian+affine point additions best is groups of 10 bits, plus 6 bits for the last group.


Some more questions.
1) what is Your formulae for affine point addition with only two multiplications ?
the most optimized version I found has multiplication+multiplication+squaring=3multiplications.
2) what is Your formulae for Jacobian point addition with only 11multiplications ?
the most optimized version I found has 17multiplications.

Also I would like to note that by test-runs for affine we have +-384iterations, but not +-760.


1. It is quite possible, that I have a mistake here, missing one multiplication. Anyway Jacobian+affine performs better.
2. "Guide to Elliptic Curve Cryptography", Algorithm 3.22. It has 8 multiplications, 3 squares, 6 subtractions.

For binary euclidean gcd I get 760 iterations, when running with input 0x8000...00 and p.

jr. member
Activity: 42
Merit: 18

In affine coordinates: 1*inversion + 2*mul-mod-p + 6*sub-mod-p
  = 16,141,271 + 2*589,568 + 6*3,068 = 17,338,815
Jacobian plus affine: 11*mul-mod-p + 6*sub-mod-p
  = 11*589,568 + 6*3,068 = 6,503,656
256-bit-multiplex: 769


Some more questions.
1) what is Your formulae for affine point addition with only two multiplications ?
the most optimized version I found has multiplication+multiplication+squaring=3multiplications.
2) what is Your formulae for Jacobian point addition with only 11multiplications ?
the most optimized version I found has 17multiplications.

of course, there are Jacobian formulaes with smaller multiplications amount, but they are limited by Z1=1, so can be used only to add pre-computated or re-converted affine<->Jacobian points.

Also I would like to note that by test-runs for affine we have +-384iterations, but not +-760.
jr. member
Activity: 42
Merit: 18
Point addition


Jacobian coordinates (including final inversion and 2 mul-mod-p):
bit  PA   256-bit-muxes

1: 255   2*256 + 256           1,676,343,279

10:  25   2*(1024*25 + 64) + 26   219,403,033




Using mixed Jacobian-affine coordinates,
adding in groups of 10 bits (25 1024-point memories plus 64-point memory),
we get 219,403,033 gates.



Thanks for fresh info !

I still can't understand how You optimizing gates count by combining bits.
In Your initial post You was using such technique for additions, now I see You are using it for any operations.

Please, explain me in a very simple steps.

For example we have two 32bits numbers to summarize.

00111111111111111111111111111111
00000000000000000000000000000001

And the same numbers splitted to 10bit limbs.

1111111111 1111111111 1111111111
0000000000 0000000000 0000000001

Let's say we are using classical cascaded full-adder(ripple-carry adder)
How amount of gates can be lowered due to splitting ?!


Either... do You mean using not 2input, but 10input gates and summarize in parallel not two, but ten numbers ? But it's against my initial limitations.

Or is this technique in some kind similiar to Karatsuba algorithm ?!
full member
Activity: 206
Merit: 450
Point addition

In affine coordinates: 1*inversion + 2*mul-mod-p + 6*sub-mod-p
  = 16,141,271 + 2*589,568 + 6*3,068 = 17,338,815
Jacobian plus affine: 11*mul-mod-p + 6*sub-mod-p
  = 11*589,568 + 6*3,068 = 6,503,656
256-bit-multiplex: 769

Affine coordinates:
bit  PA   256-bit-muxes
 1: 255   2*256             4,421,791,553
 2: 127   2*4*128           2,202,816,961
 4:  63   2*16*64           1,093,920,257
 8:  31   2*256*32            550,102,561
16:  15   2*65536*16        1,872,792,113
 9:  28   2*(512*28 + 16)     507,560,196
10:  25   2*(1024*25 + 64)    472,941,607
11:  23   2*(2048*23 + 8 )    471,251,001
12:  21   2*(4096*21 + 16)    496,432,331


Jacobian coordinates (including final inversion and 2 mul-mod-p):
bit  PA   256-bit-muxes
 1: 255   2*256 + 256           1,676,343,279
 2: 127   2*4*128 + 128           844,170,607
 4:  63   2*16*64 + 64            428,674,863
 8:  31   2*256*32 + 32           231,557,647
16:  15   2*65536*16 + 16       1,727,597,439
 9:  28   2*(512*28 + 16) + 29    221,518,452
10:  25   2*(1024*25 + 64) + 26   219,403,033
11:  23   2*(2048*23 + 8 ) + 24   239,381,207



Using mixed Jacobian-affine coordinates,
adding in groups of 10 bits (25 1024-point memories plus 64-point memory),
we get 219,403,033 gates.

full member
Activity: 206
Merit: 450
Multiplication is cheaper since p is fixed.

1-bit full-add: 2*XOR + 2*AND + 1*OR
1-bit half-add: 1*XOR + 1*AND
1-bit full-add-zero: 1*XOR + 1*AND
1-bit full-add-one: 1*XNOR + 1*OR
1-bit full-subtract: 2*XOR + 2*AND + 1*OR + 2*NOT
1-bit half-subtract: 1*XOR + 1*AND + 1*NOT
1-bit half-add-one: 1*NOT + wire

256-bit-add: 255*full-add + 1*half-add
  = 511*XOR + 511*AND + 1*OR
256-bit sub: 255*full-subtract + 1*half-subtract
  = 511*XOR + 511*AND + 255*OR + 511*NOT
256-bit-add-p: 249*full-add-one + 1*half-add-one + 6*full-add-zero
  = 249*XNOR + 249*OR + 1*NOT + 6*XOR + 6*AND
256-bit-add-minus-p: 249*full-add-zero + 1*half-add-one + 6*full-add-one
  = 249*XOR + 249*AND + 1*NOT + 6*XNOR + 6*OR
256-bit mux: 768*NAND + 1*NOT

add-mod-p: 256-bit-add + 256-bit-add-minus-p + 256-bit-mux
 = 6*XNOR + 760*XOR + 760*AND + 7*OR + 768*NAND + 2*NOT
sub-mod-p: 256-bit-sub + 256-bit-add-p + 256-bit-mux
 = 249*XNOR + 517*XOR + 517*AND + 504*OR + 768*NAND + 513*NOT
mul-mod-p: 256*add-mod-p
 = 0x600*XNOR + 0x2f800*XOR + 0x2f800*AND + 0x700*OR + 0x30000*NAND + 0x200*NOT

mul-mod-p is 589,568 gates; this makes it ~27 times cheaper than inversion

jr. member
Activity: 42
Merit: 18
Here we go !
One more test code. no over 256bits bug, returns also symmetric by module result. a little bit fuzzy conditions checking logic, I can call it quasi-recursive. it's binary in it's nature, just realized in a kind strange way.

Part of results:

test# calls# in [out, mod symm out]
0 364 75316636100529471829340583998430204208230490280793297630878645139432980850156 [-21481456315583653538655726848489351884800251925024345069131289520822185938657, 33025674478099610959444009195219596039238050737410647384245105885135854051132]
1 385 39888426508246177183753123701414070167668113673592482698911298409940354217536 [-29960760575314177163053332470999754258247890690156906035102057524026276070769, 86973073792154943490949046208733472044349727771710623997020383424312036814343]
2 360 14620752759159978489146343667071343355650196237127974340021988638599091328303 [12174829735737702474779493260773244294171916983485771075179095797017886235155, -96421093662667652633700306227539987869225167103772496913502848083575395070788]
3 383 102729399071985721675035906741175076856965109198477353566672326776001633036975 [70580439539010164434200376999720640409016804834521829709981617744376762996302, -79555186999421925612988503851178736579078677622324526318628793408380057609471]
4 370 46339676628294076970974548809934900743231066044015694600462844368554430445143 [2539235054366037579706180091206398081627721978267432321188130108233551671292, -6344958648894604413478372293328991555118887139277093896810498521216296850165]
5 368 26660698059602402176427980216008131458879504029132938908230058389025515567933 [16540677529517895161096648910563987919480254183872798943295671897534551358308, -71839064538438794363947864942753805375404185837480918551631678838489697347191]
6 354 594851748071041868996674850995031382743985475622021247651263573480083167890 [-64638328581828718437634047594178337042202536377421456333765449858808997403, 12582306659716172481129154954429736186019811569145338237994747628567568503971]
7 359 80681079014327548209785902873606497791641955222184391014151038138305717861451 [44686311015859290531071619784617096434102136509364594238047222505744964738512, -64133020728639186296951506401894836855168854511276042070060900652427696988205]
8 372 82669520055979716581712547816646165489596227017635063706256738823534854476311 [45051940896140100306048968032786700593661378537963560664532177163878090793890, -63102560133743198571360833063775568154007385207992635050991710591879861205579]
9 360 55943743540139384302837049042449505019549115662377158567347079861108763368228 [-21482989564914934180757589310789560187172090496876602635626422561634142769637, 44465387680038776841839359752521363942810960966395512063353292864302945775019]
10 349 66724607627136654255694803441709862788832479174235937080362394192249729343898 [29069806528358354934610693165080779089541395635574791746246411666180327081765, -50446960294663850907746139715414905993368124572672804919153802516270964965753]
#
#
#

As in previous code iterations count close to 384number.
But it's not good iterations indicator. I have idea how to count real true unrolled iterations, but need some time to implement.


Code:
#code by hakabit supraspecially for gmaxwell
#algorithm source: https://dspace.cvut.cz/bitstream/handle/10467/77227/F3-DP-2018-Zhumaniezov-Alisher-priloha-Testing.py

import random

def bin_ext_gcd(a, b):

    global cc
    cc = cc + 1

    if a == 0:
        #print(a, b, 0, 1)
        return [0, 1]

    if b == 0:
        #print(a, b, 1, 0)
        return [1, 0]

    if a == b:
        #print(a, b, 1, 0)
        return [1, 0]

    if (a % 2 == 0) & (b % 2 == 0):
        x, y = bin_ext_gcd(a // 2, b // 2)
        #print(a, b, x, y)
        return [x, y]

    if a % 2 == 0:
        x, y = bin_ext_gcd(a // 2, b)
        if x % 2 == 0:
            x, y = x // 2, y
        else:
            x, y = (x + b) // 2, y - a // 2
        #print(a, b, x, y)
        return [x, y]

    if b % 2 == 0:
        x, y = bin_ext_gcd(a, b // 2)
        if y % 2 == 0:
            x, y = x, y // 2
        else:
            x, y = x - b // 2, (y + a) // 2
        #print(a, b, x, y)
        return [x, y]

    if a > b:
        x, y = bin_ext_gcd((a - b) // 2, b)
        if x % 2 == 0:
            x, y = x // 2, y - x // 2
        else:
            x, y = (x + b) // 2, y - (a - b) // 2 - (x + b) // 2
        #print(a, b, x, y)
        return [x, y]

    if a < b:
        x, y = bin_ext_gcd(a, (b - a) // 2)
        if y % 2 == 0:
            x, y = x - y // 2, y // 2
        else:
            x, y = x - (b - a) // 2 - (y + a) // 2, (y + a) // 2
        #print(a, b, x, y)
        return [x, y]

cc = 0

p = 2**256 - 2**32 - 977


#cc=0
#a=12
#res=bin_ext_gcd(p,a)
#print ("initial test, should be in=12, out=48246703848881748093154577086953294938862493610683568349773993336628681113193")
#print (res,cc)



print ("test# calls# in [out, mod symm out]")
for i in range(100):

    #print(i)
    ii = random.getrandbits(256)
    ii = ii % p

    if (ii < 2):
        ii = ii + 2

    cc = 0
    res = bin_ext_gcd(p, ii)
    print(i,cc,ii,res)

#end
jr. member
Activity: 42
Merit: 18
So here is my python code to test iterations.(see at the end)

And some part of test-results:

test# in (out, iterx, iter1, iter2, iter3, itersum):
0 51387622388232169033260144332317924780174711222857253999722512827559407809221 (42194227499242744317322493862270938245042350176722270391089169414969189792505, 173, 0, 177, 193, 370)
1 49303630616562965676786885508349510956169277855033004449417649058314657299485 (47016891994125063651541437063141305860063754426223854778073828393151612275997, 178, 0, 192, 172, 364)
2 60564124354985471474884769266985666656540021107491550318511011936878151315223 (96860706416467978990701996056838683430055307461271246547818608610050676002308, 181, 0, 175, 187, 362)
3 26876395035512405675280385359306825136587569085110256960487392157911296412919 (68790574322733353789092487174594866418670893295029170296485893186408773988152, 182, 0, 176, 183, 359)
4 11023636690140810652109138350092873223350285162460178239756198575491250070806 (79574857778441618190374605398778396331605356310850458929090676313728623789452, 186, 0, 163, 191, 354)
5 48673899963758261337290584766356417468862845552291965980487962550310458587722 (43493957935191599280071520123333639912283544665443371757241332743464486359707, 183, 0, 177, 184, 361)
6 13375659417630302272707017036247711675902473928798522670864387374916032135255 (91371322322092500654928225171492304811395845400276376022117681208712975765739, 174, 0, 178, 179, 357)
7 18933447562660498799151803539411934730546092368277561544957971138167446797703 (58387591255298048564603862915862446716810659550968633559759220573583656442311, 185, 0, 195, 167, 362)
#
#
#

Test-runs shows that:
1) step5 cannot be skipped as it was suggested by j2002ba2, because usually it will lead to infinite loop, especially for small numbers. try to send as input 12.
2) algortihm has internal bug leading to output results bigger than 256bits. try to send as input 10321180907436651991825056156721508500075068309065038805868074283166084193390. I've placed a fix - to send back moduled resuts.
3) itersum shows that it's quite impossible to get more than 512 iterations, I feel 384 is the biggest possible.
keep in mind that iter2 and iter3 loops are not stand-alone, they are running internally in iterx loop.

In fact I am not sure if it's good idea to use itersum=iter2+iter3 as iterations indicator. possibly it's more correct to pay attention only to iterx, or to largest iter.


Code:
#code by hakabit supraspecially for gmaxwell
#algorithm source: http://cacr.uwaterloo.ca/hac/about/chap14.pdf, page 608, section "14.61 Algorithm Binary extended gcd algorithm"

import random
#print(random.getrandbits(256))

def ext_binary_gcd(x, y):

    g = 1

    iterx = 0
    iter1 = 0
    iter2 = 0
    iter3 = 0

    while (x % 2 == 0) and (y % 2 == 0) :    # yeah, funny cycle when one number is prime :-)

        iter1 = iter1 + 1

        x = x // 2
        y = y // 2
        g = 2 * g

    u = x
    v = y
    A = 1
    B = 0
    C = 0
    D = 1

    while (u != 0) :

        iterx = iterx + 1

        #step4
        while (u % 2 == 0) :

            iter2 = iter2 + 1

            u = u // 2

            if (A % 2 == 0) and (B % 2 == 0) :

                A = A // 2
                B = B // 2

            else :

                A = (A + y) // 2
                B = (B - x) // 2

        while (v % 2 == 0) :

            iter3 = iter3 + 1

            v = v // 2

            if (C % 2 == 0) and (D % 2 == 0) :

                C = C // 2
                D = D // 2

            else :

                C = (C + y) // 2
                D = (D - x) // 2

        if (u >= v) :

            u = u - v
            A = A - C
            B = B - D

        else :

            v = v - u
            C = C - A
            D = D - B

    #a = C
    #b = D

    pp = 2 ** 256 - 2 ** 32 - 977
    b = D % pp   #BLM-rig

    #return (a, b, g*v)
    return (b, iterx, iter1, iter2, iter3, iter2 + iter3)

p = 2**256 - 2**32 - 977

#a=12
#res=ext_binary_gcd(p,a)
#print ("initial test, should be in=12, out=48246703848881748093154577086953294938862493610683568349773993336628681113193")
#print ("in (out, iterbig, iter1, iter2, iter3): ",a,res)



print("test# in (out, iterx, iter1, iter2, iter3, itersum): ")
for i in range(1000):

    #print(i)
    ii = random.getrandbits(256)
    ii = ii % p

    if (ii < 2):
        ii = ii + 2

    res = ext_binary_gcd(p, ii)
    print(i, ii, res)

#end
staff
Activity: 4326
Merit: 8951
760 cycle: 16,035,806 gates
765 cycle: 16,141,271 gates

this is close to 8.5 times more gates than the wrong estimate
This sounds more reasonable.
staff
Activity: 4326
Merit: 8951
In any case I found proofs how to calculate bound limits both for non-binary and binary GCD.
While I see that You was writing that "AFAIK no one knows how to compute an tight bound for this kind of algorithm, but useful upper limits exist which is all you need."
(BTW, if You are intrested in proof details, then I will try to find initial Ishmukhametov's articles and post them here. Just let me know.)

Calculations for 256bits numbers shows that "worst-scenario cannot be bigger than 350 iterations for non-binary GCD"
and that "worst-scenario should be no smaller than 512 iterations for binary GCD."
As result it means that "it is guaranteed to get result in no more than 512 iterations for worst-scenario (both for binary both for non-binary GCD)".
By a "tight bound" I mean the exact number of iterations which is enough for the worst input, but no longer.  Instead, what we can compute is a loose bound: a number which is large enough for any input but might be too large.

"worst-scenario should be no smaller than 512 iterations for binary GCD." might well be true, but "no smaller than" is not useful to you. For determining the circuit size you need to know that the worst case is no larger than some size.

As an aside, I found some random binary GCD from an internet search and it typically needed around a bit over 1000 iterations for random inputs and the secp256k1 field prime. (I would have tried the one from j2002ba2's earlier post but it gave wrong results when I tested it, though perhaps I made an error while transliterating it).

Quote
You are speaking about GCD algorithm with 735iterations. Why do we need 735 if 512 enough ? My answer - lot of them are dummy(fake).
They deliberately(willfully) added dummy iterations generated by quasi-random algorithm as technique to fight against side-channel attacks.
Nope that is just incorrect.

There aren't any dummy iterations except in the sense that on some numbers it takes fewer than the maximum so there will be some iterations that don't do anything useful-- but those iterations will still be there in the logic implementation. ... also in the sense that the known maximum bound isn't tight, even on the worst number it probably takes a few less.

The algorithm of the safegcd paper takes more iterations than some other constant time binary egcds because it is a least-significant bit algorithm:  each iteration only looks at and updates the least significant bits of their numbers.  In software with 32/64 bit numbers this ends up being a LOT faster than versions that have to update the whole 256 bit numbers at each step.

Implemented as raw logic I suspect a MSB binary egcd will be more efficient because you can shrink the numbers that it works on a bit at a time as their magnitude is guaranteed to be decreased by the algorithm.
full member
Activity: 206
Merit: 450
For binary euclidean gcd the maximum number of iterations is <=760.

To achieve it let u is zero with msb=1. Then 6 does 255 cycles (the number of zeroes), and we end with u=1. Next 21 and 12 are alternated, except for the few zero bits in p. 12 is executed 255 times, and 21 - 249. Finally u and v are both 1, and 18 is executed.

Another way to look at it is both 6 and 12 have to be executed 255 times max. This limits either 18 or 21 to be executed at most 255 times. This gives <=765 cycles.

Even A tends towards zero, odd A tends towards p. After every execution of 18 u is even and 6 is executed at least once. Same for 21 and 12. Then A and C are between -2p and 2p (after 18/21). 256+1 bits plus 1 sign bit, for toal 258 bits.


1-bit full-adder: 2*XOR + 2*AND + 1*OR
1-bit half-adder: 1*XOR + 1*AND
1-bit full-add-to-zero: 1*XOR + 1*AND
1-bit full-add-to-one: 1*XNOR + 1*OR
1-bit full-subtract: 2*XOR + 2*AND + 1*OR + 2*NOT
1-bit half-subtract: 1*XOR + 1*AND + 1*NOT
1-bit full-sub-zero: 1*XOR + 1*AND + 1*NOT
1-bit full-sub-one: 1*XNOR + 1*OR + 1*NOT
1-bit half-add-to-one: 1*NOT + wire
1-bit half-add-to-one-drop-lsb: wire

258-bit add to p drop lsb (A+p)/2: 249*full-add-to-one + 1*half-add-to-one-drop-lsb + 8*full-add-to-zero
  = 249*XNOR + 249*OR + 8*XOR + 8*AND
256-bit sub: 255*full-subtract + 1*half-subtract
  = 511*XOR + 511*AND + 255*OR + 511*NOT
258-bit sub: 258*full-subtract
  = 516*XOR + 516*AND + 258*OR + 516*NOT
256-bit mux: 768*NAND + 1*NOT
258-bit mux: 774*NAND + 1*NOT
256-bit compare ge: 256*XNOR + 256*AND + 256*NOT +   255*AND +   255*AND + 255*NOT + 255*AND
  = 256*XNOR + 1021*AND + 511*NOT

We have the following possible operations in the cycle:
1. 256-bit check for zero: 255*OR (+ 1*NOT)   <- not omitted
2. 256-bit >=: 256*XNOR + 1021*AND + 511*NOT
3. 256-bit subtraction: 511*XOR + 511*AND + 255*OR + 511*NOT
4. 258-bit add p, div 2: 249*XNOR + 249*OR + 8*XOR + 8*AND
5. 258-bit subtraction: 516*XOR + 516*AND + 258*OR + 516*NOT
3,4,5 are done twice

Control:
u1 = not 1. & not u[0]              - 1*AND + 1*NOT
v1 = not 1. & u[0] & not v[0]       - 2*AND + 1*NOT
u2 = not 1. & u[0] & v[0] & 2.      - 3*AND
v2 = not 1. & u[0] & v[0] & not 2.  - 1*AND   (repeat from up)
u = not (u1 | u2)                   - 1*OR + 1*NOT
v = not (v1 | v2)                   - 1*OR + 1*NOT
A1 = u1 & not A[0]                  - 1*AND + 1*NOT
A2 = u1 & A[0]                      - 1*AND
A3 = u2
A = u
C1 = v1 & not C[0]                  - 1*AND + 1*NOT
C2 = v1 & C[0]                      - 1*AND
C3 = v2
C = v
= 11*AND + 2*OR + 6*NOT

One cycle:
1. 255*OR
2. 256*XNOR + 1021*AND + 511*NOT
3. 1022*XOR + 1022*AND + 510*OR + 1022*NOT
4. 498*XNOR + 498*OR + 16*XOR + 16*AND
5. 1032*XOR + 1032*AND + 516*OR + 1032*NOT
ctrl. 11*AND + 2*OR + 6*NOT
mux. 10800*NAND + 14*NOT
= 754*XNOR + 2070*XOR + 3102*AND + 1782*OR + 10800*NAND + 2585*NOT

760 cycles: 573040*XNOR + 1573200*XOR + 2357520*AND + 1354320*OR + 8208000*NAND + 1964600*NOT
765 cycles: 576810*XNOR + 1583550*XOR + 2373030*AND + 1363230*OR + 8262000*NAND + 1977525*NOT

---

C mod p:
-2p <= C <= 2p, hence two times add to p, and two times add to -p
256-bit add to p: 249*full-add-to-one + 1*half-add-to-one + 6*full-add-to-zero
  = 249*XNOR + 249*OR + 1*NOT + 6*XOR + 6*AND
256-bit add to -p: 248*full-add-to-zero + 1*half-add-to-one + 7*full-add-to-one:
  = 248*XOR + 248*AND + 1*NOT + 7*XNOR + 7*OR

selecting the output:
257-bit mux: 771*NAND + 1*NOT
256-bit mux: 768*NAND + 1*NOT

control the output:
omitted, <16 elements

= 512*XNOR + 508*XOR + 508*AND + 512*OR + 3078*NAND + 8*NOT

---

total:
760 cycle: 573552*XNOR + 1573708*XOR + 2358028*AND + 1354832*OR + 8211078*NAND + 1964608*NOT
765 cycle: 577322*XNOR + 1584058*XOR + 2373538*AND + 1363742*OR + 8265078*NAND + 1977533*NOT

760 cycle: 16,035,806 gates
765 cycle: 16,141,271 gates

this is close to 8.5 times more gates than the wrong estimate
jr. member
Activity: 42
Merit: 18
In cryptography constant-time algorithms are used to prevent side-channel attacks.
So of course constant-time GCD algorithm can contain 735 and even more iterations - most of them are dummy(fake, useless).

In the case of an unrolled logic implementation, *all* candidates will be constant time.  Algorithms like binary gcds are made constant time by unrolling to their maximum number of operations.

If you allow memory and looping then the problem can be computed with an extremely small number of gates... (just the smallest computer with an ALU you can find, plus enough gates for them memory for a point, the input, a couple hundred bits of temporaries, and the program itself)

Quote
I mean (and had to write) that binary version cannot be worstier than non-binary.
That is how I understood what you wrote. But all of the binary gcd variants will perform (many) more iterations than the classical GCD. You cannot use the behaviour of the classical gcd to reason about the iteration counts of some binary gcd.


Now I need to agree that I understand GCD algos in not so good way as I was thinking. Need some time to analyse them additionally.

In any case I found proofs how to calculate bound limits both for non-binary and binary GCD.
While I see that You was writing that "AFAIK no one knows how to compute an tight bound for this kind of algorithm, but useful upper limits exist which is all you need."
(BTW, if You are intrested in proof details, then I will try to find initial Ishmukhametov's articles and post them here. Just let me know.)

Calculations for 256bits numbers shows that "worst-scenario cannot be bigger than 350 iterations for non-binary GCD"
and that "worst-scenario should be no smaller than 512 iterations for binary GCD."
As result it means that "it is guaranteed to get result in no more than 512 iterations for worst-scenario (both for binary both for non-binary GCD)".

You are speaking about GCD algorithm with 735iterations. Why do we need 735 if 512 enough ? My answer - lot of them are dummy(fake).
They deliberately(willfully) added dummy iterations generated by quasi-random algorithm as technique to fight against side-channel attacks.
Either they was unrolling cycles, but unrolled them in non-correct way, adding useless iterations without special will for doing that.

And what do You think ?




P.S.
Of course, any algorithm can be calculated using only one NAND(or NOR) logic gate + a lot of memory and muxes.

I would like to write much about unrolled logic and constant-time, but prefer to wait Your answer related to iterations overcounting.
staff
Activity: 4326
Merit: 8951
In cryptography constant-time algorithms are used to prevent side-channel attacks.
So of course constant-time GCD algorithm can contain 735 and even more iterations - most of them are dummy(fake, useless).

In the case of an unrolled logic implementation, *all* candidates will be constant time.  Algorithms like binary gcds are made constant time by unrolling to their maximum number of operations.

If you allow memory and looping then the problem can be computed with an extremely small number of gates... (just the smallest computer with an ALU you can find, plus enough gates for them memory for a point, the input, a couple hundred bits of temporaries, and the program itself)

Quote
I mean (and had to write) that binary version cannot be worstier than non-binary.
That is how I understood what you wrote. But all of the binary gcd variants will perform (many) more iterations than the classical GCD. You cannot use the behaviour of the classical gcd to reason about the iteration counts of some binary gcd.
jr. member
Activity: 42
Merit: 18
I'mmm sorrry.

Calculating 2/ln(k)*ln(A) I've made kid's mistake. Not divided 2 to, but multiplied to 2.

So... 2/(0.7) * 177 = 506 max iterations.  I'm a little bit unobservant. Embarrassed Embarrassed

2/ln(2)*ln(2^256) = (2/0,6931471805599453)*177,445678223346 = 512,00000000000000923324826168937

So... 512 max iterations.

 Roll Eyes Roll Eyes Roll Eyes Roll Eyes Roll Eyes

But now I am a little bit puzzled. 390 max iterations for non-binary and 506 512 max iterations for binary.
I hope maximizing iterations is in interchange with lowering logic gates(operations).
jr. member
Activity: 42
Merit: 18
However I was thinking that worstiest EGCD(or BEGCD) scenario cannot be worstier than GCD.
No, that is entirely untrue, unfortunately.

Binary GCD is faster to use not because it makes fewer iterations but because there isn't a very expensive division in each iteration.


I am sorry, I messed up with abbreviations.
I mean (and had to write) that binary version cannot be worstier than non-binary.

I provided proof that non-binary version has max 390 iterations. So I was trying to say, that there are no needs to process 735iterations for binary version.
But 390 - was upper(but not lower) bound limiter. Today I provided proof for 248 max iterations. (and as I understand that's lower bound limiter ?!)

What about 735... in other post You provided URL to PDF on cr.yp.to website. This PDF named "Fast constant-time
gcd computation and modular inversion".

In cryptography constant-time algorithms are used to prevent side-channel attacks.
So of course constant-time GCD algorithm can contain 735 and even more iterations - most of them are dummy(fake, useless).
staff
Activity: 4326
Merit: 8951
However I was thinking that worstiest EGCD(or BEGCD) scenario cannot be worstier than GCD.
No, that is entirely untrue, unfortunately.

Binary GCD is faster to use not because it makes fewer iterations but because there isn't a very expensive division in each iteration.
jr. member
Activity: 42
Merit: 18
The algorithm you are discussing is a binary (extended) gcd, not the "original" Euclid's algorithm.  The original Euclid's has an arbitrary division (well a modular reduction, but same deal) in the inner loop.



Yes, I know difference. Even for same input EGCD(or BEGCD) probably have different amount of iterations compared to GCD.
However I was thinking that worstiest EGCD(or BEGCD) scenario cannot be worstier than GCD.

Here
https://tuengr.com/V10/001.pdf
I found formulae(based on proved by Ishmukhametov formulaes) for EGCD.
2/ln(k)*ln(A)
it gives 2*0.7*177 = 248max iterations(2 to 256 bits numbers).



Also here
http://www.joppebos.com/files/CTInversion.pdf
https://link.springer.com/content/pdf/10.1007%2F3-540-36400-5_6.pdf
https://www.researchgate.net/publication/3044233_The_Montgomery_modular_inverse_-_Revisited
https://www.researchgate.net/publication/3387259_Improved_Montgomery_modular_inverse_algorithm

I found one more algo called "Montgomery Inverse" with additional phase to calculate back from montgomery-domain.
Proved(?!) fomulae: log2(a) < 2log2(a)
it gives min 256 and max 512 iterations for each phase(if 256bits numbers).

Each iteration contain 5*256bits additions with signed numbers.(substraction ~= addition)

However each second iteration amount of logical gates can be lowered by 1bit.

So in total we have 512 * 5 * half = 1280 additions for 1inverse. (not included muxes)

I think that's ideal algo for point inverse if we are planning to calculate secp256k1 using millions, but not trillions logical operations.
staff
Activity: 4326
Merit: 8951
The algorithm you are discussing is a binary (extended) gcd, not the "original" Euclid's algorithm.  The original Euclid's has an arbitrary division (well a modular reduction, but same deal) in the inner loop.
jr. member
Activity: 42
Merit: 18
What remains is the maximum number of iterations of the cycle, and how many bits (n) are needed for C and A.

If p is odd - and it should be, because if were even we could make an optimization that returns 1 or 2 depending on the parity of k [this falls apart if k and p have a composite factor gcd though, like 6 or 10] - we can take advantage of the fact that even(A) and even(C) will only run once in the inner while loops. So for the entire even(u) and even(v) iterations, each addition of A+p and C+p should increase their length by no more than ceil(log2(p)) bits. That means C+p will only be computed twice because 0+p => odd, 0+p+p => even. A+p will also only be done twice: at 1, and at A-C or C-A, which is (1+p)/arbitrary-p or p-(1+p)/arbitrary depending on which is larger, keeping in mind 1+p is always even. The second A+p addition will therefore be either (1+p)/arbitrary or p - (1+p)/arbitrary. At any rate, less than 1+p.

So I think we can put a ceiling for the number of bits at ceil(log2(1+p)) bits for C and A [this is for the special case where gcd(k,p) is a single prime >2].

The maximum #iterations problem still stands though - maybe there's some research papers on the internet documenting this?

but I found one more idea on https://en.wikipedia.org/wiki/Euclidean_algorithm#Worst-case

"Worst-case
If the Euclidean algorithm requires N steps for a pair of natural numbers a > b > 0, the smallest values of a and b for which this is true are the Fibonacci numbers FN+2 and FN+1, respectively.[95] More precisely, if the Euclidean algorithm requires N steps for the pair a > b, then one has a ≥ FN+2 and b ≥ FN+1. This can be shown by induction.[96] If N = 1, b divides a with no remainder; the smallest natural numbers for which this is true is b = 1 and a = 2, which are F2 and F3, respectively. Now assume that the result holds for all values of N up to M − 1. The first step of the M-step algorithm is a = q0b + r0, and the Euclidean algorithm requires M − 1 steps for the pair b > r0. By induction hypothesis, one has b ≥ FM+1 and r0 ≥ FM. Therefore, a = q0b + r0 ≥ b + r0 ≥ FM+1 + FM = FM+2, which is the desired inequality. This proof, published by Gabriel Lamé in 1844, represents the beginning of computational complexity theory,[97] and also the first practical application of the Fibonacci numbers.[95]

This result suffices to show that the number of steps in Euclid's algorithm can never be more than five times the number of its digits (base 10).[98] For if the algorithm requires N steps, then b is greater than or equal to FN+1 which in turn is greater than or equal to φN−1, where φ is the golden ratio. Since b ≥ φN−1, then N − 1 ≤ logφb. Since log10φ > 1/5, (N − 1)/5 < log10φ logφb = log10b. Thus, N ≤ 5 log10b. Thus, the Euclidean algorithm always needs less than O(h) divisions, where h is the number of digits in the smaller number b."



in secp256k1 we are working with module m=
115792089237316195423570985008687907853269984665640564039457584007908834671663

and 256bits 2^256=
115792089237316195423570985008687907853269984665640564039457584007913129639936

in any case that's the maximaiest numbers. they both contain 78digits.

So 78*5= 390 maximum iterations.

staff
Activity: 4326
Merit: 8951
The maximum #iterations problem still stands though - maybe there's some research papers on the internet documenting this?
As I mentioned, it's not very simple.

https://gcd.cr.yp.to/safegcd-20190413.pdf  Covers doing this for a somewhat different GCD algorithm (a variant which is designed to work only on the LSBs of values which makes it much faster in software).

The bounds analysis in the paper is extremely complicated.

Pieter came up with a simpler to understand approach that also produces a somewhat tighter bound (for safegcd): The idea comes from the fact that the steps of the algorithm are all affine transformations on the inputs.  This means that if instead feeding a value into a step of the algorithm you feed in a convex hull that captures all the possible inputs  andthe result is a convex hull that contains all the possible outputs of that step (and maybe a bit more). You can 'feed in' a hull by just sending its vertices through because it always performs an affine transformation.  You handle the branches at each iteration by executing all of them and taking the union of the resulting hulls. Keep running until the hull shrinks below 1 which means that every input would have finished by then. It's kind of like you're concurrently running the algorithm on all inputs at once. At least for safegcd this actually converges.

We know from testing with small inputs that the result is still somewhat conservative rather than exact.  AFAIK no one knows how to compute an tight bound for this kind of algorithm, but useful upper limits exist which is all you need.

You could, of course, just use safegcd... though I think a MSB binary GCD will be more efficient implemented in raw logic.  I don't know a good cite for one off  the top of my head, there probably is a cite in the safegcd paper.  IIRC, GMP has a constant time GCD that implements one of the binary MSB algorithms.

Depending on the application it might be acceptable to even make it sometimes be incorrect. Smiley

E.g. if this were being used for mining you could pick a bound that is wrong less than 1:1000000 times based on experimental testing and that would probably be fine. Smiley


Quote
Now, You are saying that 256 iterations not enough, and we need 735.
No, I was saying a different related algorithm needed 735.  I don't know what the bound is on that particular one. I just know it's going to be greater than 256.   (if you look at the general structure though you should see that it shouldn't usually shave off more than one bit per iteration (only when the values are extremely close will it shave off more)-- but it's not guaranteed to even do that, so that should give you an intuition as to why 256 is far too low).

Quote
You see, Jacobian has 17mul

Well, don't use needlessly inefficient algorithms.  Cheesy  One of your inputs is always affine and the group law can be algebraically simplified for secp256k1.

https://github.com/bitcoin-core/secp256k1/blob/master/src/group_impl.h#L389

(for point*scalar you will never get an infinity in the calculation either.)

For optimizing classical(non-modular) multiplications I've heard about Karatsuba algorithm.

Sure, and toom-cook. Some useful overview in https://cr.yp.to/lineartime/multapps-20080515.pdf.

(squaring is faster than multiplying for pretty much the same reason that karatsuba is a speedup)

Montgomery helps when you have consecutive squaring or multiplying -- saving you from having to reduce at the expense of doubling the size of your numbers. But you need a conversion to/from Montgomery before performing addition/subtraction -- so I wouldn't expect Montgomery to be a win for point addition.

The secp256k1 field order is somewhat more efficient to modular reduce by than an arbitrary number.  There is a complicated implementation in libsecp256k1 though it's optimized for processors and doesn't make the right tradeoffs for dumb logic.

As far as jacobian vs affine.  If you're able to operate on multiple points in parallel and share work in the inversion then affine will be faster for sure. If you cannot, then I remain sceptical that you're not miscounting something.
jr. member
Activity: 42
Merit: 18
What remains is the maximum number of iterations of the cycle, and how many bits (n) are needed for C and A.

If p is odd - and it should be, because if were even we could make an optimization that returns 1 or 2 depending on the parity of k [this falls apart if k and p have a composite factor gcd though, like 6 or 10] - we can take advantage of the fact that even(A) and even(C) will only run once in the inner while loops. So for the entire even(u) and even(v) iterations, each addition of A+p and C+p should increase their length by no more than ceil(log2(p)) bits. That means C+p will only be computed twice because 0+p => odd, 0+p+p => even. A+p will also only be done twice: at 1, and at A-C or C-A, which is (1+p)/arbitrary-p or p-(1+p)/arbitrary depending on which is larger, keeping in mind 1+p is always even. The second A+p addition will therefore be either (1+p)/arbitrary or p - (1+p)/arbitrary. At any rate, less than 1+p.

So I think we can put a ceiling for the number of bits at ceil(log2(1+p)) bits for C and A [this is for the special case where gcd(k,p) is a single prime >2].

The maximum #iterations problem still stands though - maybe there's some research papers on the internet documenting this?

on https://math.stackexchange.com/questions/1483118/proving-the-number-of-iterations-in-the-euclidean-algorithm
I found great post


"In two steps we move from (m,n) to (m′=n,n′=mmodn) and then to (m′′=n′,n′′=m′modn′). Assume n≤2k. Then m′,n′≤2k and n′′≤min{m′,n′,|m′−n′|}. Henc n′′>2k−1 could only happen if both |m′−n′|>2k−1 and min{n′,m′}>2k−1, which implies max{m′,n′}>2k−1+2k−1=2k, contradiction. Hence after two steps n≤2k becomes n′′≤2k−1
Hence if n≤2k initially, then after 2k steps we arrive at a situation with n≤20=1. Since it takes one additional step to reach n=0 and terminate, and since initially we may have n>m, it seems we may need up to 2k+2 rounds.

Indeed, the desired bounds are not fully correct, as for example m=1, n=42 (so k=0) requires 2 instead of just 2⋅0=0 steps: (1,42)↦(42,1)↦(1,0).

Remark: A deeper analysis should exhibit that the behaviour is governed by powers of the golden ration ϕ≈1.618 rather than 2–√≈1.414."



In secp256k1 we are working with field where k=256. so result should be maximum 2*256+2 iterations = 514.

But also topicstarter speaks about golden ratio and I don't understand do we need 514 to multiply to 1.618(=832) or to divide with 1.618(=316) ?!

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
What remains is the maximum number of iterations of the cycle, and how many bits (n) are needed for C and A.

If p is odd - and it should be, because if were even we could make an optimization that returns 1 or 2 depending on the parity of k [this falls apart if k and p have a composite factor gcd though, like 6 or 10] - we can take advantage of the fact that even(A) and even(C) will only run once in the inner while loops. So for the entire even(u) and even(v) iterations, each addition of A+p and C+p should increase their length by no more than ceil(log2(p)) bits. That means C+p will only be computed twice because 0+p => odd, 0+p+p => even. A+p will also only be done twice: at 1, and at A-C or C-A, which is (1+p)/arbitrary-p or p-(1+p)/arbitrary depending on which is larger, keeping in mind 1+p is always even. The second A+p addition will therefore be either (1+p)/arbitrary or p - (1+p)/arbitrary. At any rate, less than 1+p.

So I think we can put a ceiling for the number of bits at ceil(log2(1+p)) bits for C and A [this is for the special case where gcd(k,p) is a single prime >2].

The maximum #iterations problem still stands though - maybe there's some research papers on the internet documenting this?
full member
Activity: 206
Merit: 450
From "SPA vulnerabilities of the binary extended Euclidean algorithm":
Binary extended Euclidean algorithm
Inputs: Integers k and p such that gcd(k, p) = 1
Output: k−1 mod p
 1: u = k
 2: v = p
 3: A = 1
 4: C = 0
 5: while u != 0 do
 6:   while even(u) do
 7:     u = u/2
 8:     if even(A) then
 9:       A = A/2
10:     else
11:       A = (A+p)/2
12:   while even(v) do
13:     v = v/2
14:     if even(C) then
15:       C = C/2
16:     else
17:       C = (C+p)/2
18:   if u ≥ v then
19:     u = u-v
20:     A = A-C
21:   else
22:     v = v-u
23:     C = C-A
24: return C mod p


After unrolling the cycles, we could at each iteration do either of
* u = u/2, A = A/2 or (A+p)/2
* v = v/2, C = C/2 or (C+p)/2
* u = u-v, A = A-C
* v = v-u, C = C-A

So, a single loop cycle would be:
1. In parallel:
* Check if u = 0
* Check if u >= v
* u1 = u/2 (drop lsb)
* u2 = u-v (256-bit subtraction)
* v1 = v/2
* v2 = v-u
* A1 = A/2
* A2 = (A+p)/2 (n-bit addition, dropping lsb)
* A3 = A-C     (n-bit subtraction)
* C1 = C/2
* C2 = (C+p)/2
* C3 = C-A
For u and v we need 3 muxes each - passthrough (old value), u1, u2
For A and C we need 4 muxes each - passthrough, A1, A2, A3

2. Select what happens in this cycle, default being passthrough, highest priority first:
* u = 0 : all passthrough
* u is even : u1, if A is even A1 else A2
* v is even : v1, if C is even C1 else C2
* u >= v : u2, A3
* else : v2, C3

What remains is the maximum number of iterations of the cycle, and how many bits (n) are needed for C and A.

full member
Activity: 206
Merit: 450
Most probably I made a mistake in binary extended gcd. It should be slower than mul-mod-p.

I'll post a detailed algorithm -> logic tomorrow.
jr. member
Activity: 42
Merit: 18
However, j2002ba2 provided such great algorithm that there's almost no advantages in using Jacobian coordinates.
I am puzzled by the fact that you wrote this but then pasted code that uses Jacobian addition law as your example. What is your point?

Aside, the extended gcd is not guaranteed to complete in 256 steps (in fact, I expect it seldom completes in 256 steps)-- it can take much more.  The constant time LSB euclidean we have for libsecp256k1 is only guaranteed to complete in 735 steps, and although non-LSB versions have somewhat smaller bounds they're still a lot larger than 256.

It's a little annoying to prove an upper bound on the operations of these GCD algorithms.

I'm surprised to hear the claim that when working with basic logic the inversion could be cheaper than a field multiply. In fact, I don't believe it: I think this is an artifact of radically over-costing the modular multiply and underestimating the number of iterations needed.

Even if inversion is slow, depending on what you're doing it still makes sense to use affine logic because inversions can be batched (though batching only makes sense if inversion is at least three times the cost of a modular multiplication).

gmaxwell,
I have a very base knowledge in cryptography, some knowledge in programming and in electronic circuits, and no knowledge in high grade mathematica.
I am doing my research from scratch. I am sorry in advance if I am sayingh something really wrong.

My point is that
0) (my first version using Euler's theorem algorithm uses ~511mul for each round. no way to discuss.)

1) Jacobian algorithm uses 17mod-multiplications(mul) and 6mod-substractions(sub) for each round.
Preliminary 1mul ~= 511mod-additions(add), and 1sub ~= 1add.
So in total 17mul + 6sub = 17*511add + 6add = 8693add.

2) And binarian Euclidean algorithm uses 256*(1add+3sub+2mux) for each round inverse and 3mul + 6sub for additional calculations.
Preliminary 1mux ~= 1add.
So in total 256*(1add+3sub+2mux) +(3mul+6sub) = 256*(6add) + 3*511add + 6add = 1536 + 1533 + 6 = 3075add


So 3075add < 8693add.

Now, You are saying that 256 iterations not enough, and we need 735. (let's round up to 768)

Ok. (768/256 - 256/256) * 1536add = 2 * 1536add = 3072add need to be added.

3075add + 3072add = 6147add.

Still 6147add < 8693add.



What about modular multiplication optimization.
You see, Jacobian has 17mul and Euclidean 3mul. So any mod-mul optimizations will be useful for both algorithms. However, of course in relative numbers it is more helpful for Jacobian, but not the fact that in absolute numbers Jacobian overbeat.

For optimizing classical(non-modular) multiplications I've heard about Karatsuba algorithm.
For optimizing modular multiplications usually mentioned Montgomery algorithm.
At this moment I absolutely do not understand how do they work both of them. :-)


Also, I am not sure,
gmaxwell, so do You feel affine or Jacobian algo should use smaller amount of logical operations ?!
staff
Activity: 4326
Merit: 8951
However, j2002ba2 provided such great algorithm that there's almost no advantages in using Jacobian coordinates.
I am puzzled by the fact that you wrote this but then pasted code that uses Jacobian addition law as your example. What is your point?

Aside, the extended gcd is not guaranteed to complete in 256 steps (in fact, I expect it seldom completes in 256 steps)-- it can take much more.  The constant time LSB euclidean we have for libsecp256k1 is only guaranteed to complete in 735 steps, and although non-LSB versions have somewhat smaller bounds they're still a lot larger than 256.

It's a little annoying to prove an upper bound on the operations of these GCD algorithms.

I'm surprised to hear the claim that when working with basic logic the inversion could be cheaper than a field multiply. In fact, I don't believe it: I think this is an artifact of radically over-costing the modular multiply and underestimating the number of iterations needed.

Even if inversion is slow, depending on what you're doing it still makes sense to use affine logic because inversions can be batched (though batching only makes sense if inversion is at least three times the cost of a modular multiplication).
jr. member
Activity: 42
Merit: 18
Today I was trying to find more effective ways to ECC point additions.
I paid attention to Jacobian coordinate system.

I found some Python code on internet, then mixed them to my test-code.(see at the end of post)

On my PC with Python2 it gives answer "(36422191471907241029883925342251831624200921388586025344128047678873736520530L,
 20277110887056303803699431755396003735040374760118964734768299847012543114150L)"
and that is correct answer.

We need to pay attention that it still needs modular inversion, but that is done only for last point addition from all 256 in secp256k1.

However, j2002ba2 provided such great algorithm that there's almost no advantages in using Jacobian coordinates.

Here is list of operations:
    U1 = (Xp * Zq ** 2) % P
    U2 = (Xq * Zp ** 2) % P
    S1 = (Yp * Zq ** 3) % P
    S2 = (Yq * Zp ** 3) % P
    H = U2 - U1
    R = S2 - S1
    H2 = (H * H) % P
    H3 = (H * H2) % P
    U1H2 = (U1 * H2) % P
    nx = (R ** 2 - H3 - 2 * U1H2) % P
    ny = (R * (U1H2 - nx) - S1 * H3) % P
    nz = (H * Zp * Zq) % P

In fact, 17mod-multiplications and 6mod-substractions for each round.



Code:
def inv(a, m):
    if a < 0 or m <= a: a = a % m
    c, d = a, m
    uc, vc, ud, vd = 1, 0, 0, 1
    while c != 0:
        q, c, d = divmod(d, c) + (c,)
        uc, vc, ud, vd = ud - q*uc, vd - q*vc, uc, vc
    if ud > 0: return ud
    else: return ud + m

def to_jacobian((Xp, Yp)):
    """
    Convert point to Jacobian coordinates

    :param (Xp,Yp,Zp): First Point you want to add
    :return: Point in Jacobian coordinates
    """
    return (Xp, Yp, 1)

def from_jacobian((Xp, Yp, Zp), P):
    """
    Convert point back from Jacobian coordinates

    :param (Xp,Yp,Zp): First Point you want to add
    :param P: Prime number in the module of the equation Y^2 = X^3 + A*X + B (mod p)
    :return: Point in default coordinates
    """
    z = inv(Zp, P)
    return ((Xp * z**2) % P, (Yp * z**3) % P)

def jacobian_add((Xp, Yp, Zp), (Xq, Yq, Zq), A, P):
    """
    Add two points in elliptic curves

    :param (Xp,Yp,Zp): First Point you want to add
    :param (Xq,Yq,Zq): Second Point you want to add
    :param P: Prime number in the module of the equation Y^2 = X^3 + A*X + B (mod p)
    :param A: Coefficient of the first-order term of the equation Y^2 = X^3 + A*X + B (mod p)
    :return: Point that represents the sum of First and Second Point
    """
    if not Yp:
        return (Xq, Yq, Zq)
    if not Yq:
        return (Xp, Yp, Zp)
    U1 = (Xp * Zq ** 2) % P
    U2 = (Xq * Zp ** 2) % P
    S1 = (Yp * Zq ** 3) % P
    S2 = (Yq * Zp ** 3) % P
    if U1 == U2:
        if S1 != S2:
            return (0, 0, 1)
        return jacobian_double((Xp, Yp, Zp), A, P)
    H = U2 - U1
    R = S2 - S1
    H2 = (H * H) % P
    H3 = (H * H2) % P
    U1H2 = (U1 * H2) % P
    nx = (R ** 2 - H3 - 2 * U1H2) % P
    ny = (R * (U1H2 - nx) - S1 * H3) % P
    nz = (H * Zp * Zq) % P
    return (nx, ny, nz)

def fast_add(a,b, A, P):
    return from_jacobian(jacobian_add(to_jacobian(a), to_jacobian(b), A, P), P)

P = 2**256 - 2**32 - 977
A=0

a=39006303077722201472019215118849942292522572644014512222183524570138876820125,67456145645462716033455438706152384917972612258004579542185540136844047420439

b=4032983015753143990395647783770666587927265353624430905763286836981504199392,44353125519324157186344456159742269880631179110473143840214086765587351124293

g2=fast_add(a,b,A,P)

print g2
jr. member
Activity: 42
Merit: 18
I am sorry, after some thinking, I think I've made mistake saying that j2002ba2 is ignoring recursion.

He was writing "inversion: 0x100*add + 0x300*sub + 0x200*multiplex" (but that's in HEX)
In DEC it's "inversion: 256*add + 768*sub + 512*multiplex"

So it coincide with 256iterations.



...and still trying to understand how he got optimizing from 1,598,167,296 gates to 182,767,728 gates...
jr. member
Activity: 42
Merit: 18
or smth similiar to memory relays used.
Not using memory will brutally hurt your performance.

Quote
This means about ~4trillion transistors while AMD Ryzen 9 CPU has only 9billion transistors.
Well 4 trillion sounds wrong but a ryzen 9 cpu takes a great many clock cycles to complete one operation. It wouldn't be shocking to me if an unrolled implementation had a gate count comparable to a big cpu.

... though also most of that cpu's gates are in sram, and you've excluded memory sooo...


but my question is theoretical.  Roll Eyes so performance is not important.
you see, previously someone had interest to sha256, now I have interest to secp256k1.
I think for many bitcointalkers it's very interesting what the hell is amount of logical operations needed to calculate secp256k1.

yeah, 4trillion transistors and for me sounds absurdically large. so that's why I am looking for help on forum.
Now I almost for sure that I've made mistake saying that 256bits modular adder consist of 2000 NOT, 3000 AND, 7000 XOR logic gates.
It's more correct to say that 200 NOT, 300 AND, 700 XOR logic gates.
So in fact we will have not 4trillion transistors, but 400billion transistors. But that's still is unacceptably large amount.


j2002ba2 provided result total: 0x5f421900= 1,598,167,296 gates.
But I feel he just skipped that fact that Euclidean algorithm is recursive. His final result can be even larger than mine.
jr. member
Activity: 42
Merit: 18

So, What is the theoretical minimum number of logical operations an ASIC needs to perform to compute privkey->pubkey secp256k1 ?

With some additional limitations:
only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used.
only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.




A minimal example:       there might be mistakes

256-bit adder: 255*(2*XOR + 2*AND +1*OR) + 1*XOR + 1*AND = 511*XOR + 511*AND + 256*OR
257-bit adder: 513*XOR + 513*AND + 257*OR
256-bit adder with carry input: 512*XOR + 512*AND + 256*OR
256-bit sub: 512*XOR + 512*AND + 256*OR + 256*NOT
256-bit multiplexer: 512*AND + 256*OR + 1*NOT

add-mod-p: 0x400*XOR + 0x600*AND + 0x301*OR + 1*NOT
sub-mod-p: 0x401*XOR + 0x601*AND + 0x301*OR + 0x101*NOT
mul-mod-p: 0x100*add-mod-p + 0x100*sub-mod-p + 0x200*multiplex
              = 0x80100*XOR + 0xe0100*AND + 0x80200*OR + 0x30200*NOT
inversion: 0x100*add + 0x300*sub + 0x200*multiplex
              = 0x7ff00*XOR + 0xbff00*AND + 0x60000*OR + 0x30200*NOT

Looks like inversion is less costly than mul-mod-p. (There might be more efficient mul-mod-p method though.)
We'll use affine coordinates then.
In affine coordinates point addition is 1*inversion + 2*mul-mod-p + 6*sub-mod-p
  = 0x181906*XOR + 0x280706*AND + 0x161606*OR + 0x90906*NOT

256 point-additions + 256 multiplexers: 0x18190600*XOR + 0x28090600*AND + 0x14170600*OR + 0x9090700*NOT

some missing logic for the leading zero bits: less than 4K
missing inversion logic: less than 1M

total: 0x5f421900= 1,598,167,296 gates.






I see we are getting almost identical results for 1round, but using slightly different terminology.

My version is "So 1round = 0SUMM + 7SUBM + 3MULM + 1POWM".
Your version is "In affine coordinates point addition is 1*inversion + 2*mul-mod-p + 6*sub-mod-p"

SUBM = sub-mod-p
MULM = mul-mod-p

My POWM means that I am using POWM as of Euler's theorem for modular inversion calculations.
Your "inversion" leads to "Euclidean Binary extended gcd algorithm".



Now what about differences:

1) As I see for SUMM(add-mod-p) I provided: 2000 NOT, 3000 AND, 7000 XOR logic gates.
You provide: 0x400*XOR + 0x600*AND + 0x301*OR + 1*NOT (in HEX)
Or 1024*XOR + 1536*AND + 769*OR + 1*NOT(in DEC)

May be I have memory faults and there was 200 NOT, 300 AND, 700 XOR logic gates.
In any case I need to make fresh research to count how many gates need to be used for 256bits modular full-adder.

2) As I understand I forgot to include multiplexors as switchers to pre-computated points.

3) In "14.61 Algorithm Binary extended gcd algorithm" step 6. I see "If u ≥ v then". This means we need to use 256bits comparator. As I understand You forgot to include it.

By the way, 256bits comparator can consume even more logical gates than 256bits full adder. (at this moment I am doing some research).
That's because adder uses only 1bit(1wire) for carry-out and carry-in,
while comparator uses either aggregated 2bit(2wires) or segregated 3bits(3wires) quasi-carry for carry-in and carry-out.

4) In "14.61 Algorithm Binary extended gcd algorithm" step 7. I see " otherwise, go to step 4.".
This means that algorithm is recursive. How many maximum recursions there can be ?! I suppose maximum 256.
This means we need to make dummy cycle with 256 iterations,
each iteration should contain all(!) "14.61 Algorithm Binary extended gcd algorithm" calculations except step 2. and step 5. (and also exclude step 1. 3. oops)

I feel after including 256iterations the result will be that Euclidean algorithm will consume almost the same amount of logic gates as for algorithm using Euler's theorem.

It's interesting note, that v=y=1. Yes, v=1, actually. What about y=1 it's not obvious for me from first sight, but I feel You are correct.



And here are some more comments:

1) "14.61 Algorithm Binary extended gcd algorithm" on input uses unsigned integers, but all futher operations are with signed integers. Operations with signed integers consumes 2-3 times more logic gates than operations with unsigned integers.

2) On step 4.2 I see "If A ≡ B ≡ 0 (mod 2) then". If A>0 and B>0 then it is very light operation which equal to "if A and B are both even" - we just need to verify if last bit=0.
But... there can be situations when A<0 and/or B<0. Is it enough to verify that last bit=1 for negative integers ?! At this moment I am not sure. I need some time to think.

3) You are saying "If it is less costly (less gates), you could group powers of G by 2, 3, or more bits, using more multiplexers."

Do You mean that privkey not always looks like

1101111101111011 {skipped some bits} 111110111110111

But also can looks like
1001111001110011 {skipped some bits} 111110011100111 (reason for grouping powers of G by 2)

Or like
1000110001100011 {skipped some bits} 111100011000111 (reason for grouping powers of G by 3)

If I understand Your idea correctly, then my opinion is that such grouping has many pros and cons.
I don't know if it less costly. I think it's needed to make many-many test-iterations to measure pros vs. cons.

4) I don't understand Your idea about "Combining several bits" and table provided with "0: 26 PA +  1024*26 mux = 182,767,728".

Do You mean that there will be stand-alone 10bits full-adder, with 10bits+10bits input and 10bits output, so we can seqeuntially passing to it A and B, and getting back result C=A+B ?

If yes, then it is against limitation "only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.".
You see, such trick will lower number of gates, but not lower number of logic operations, also such trick will make estimation of logic operations much harder.

Or do you mean logic gates with not 2input but with 10inputs ? Then it is against limitation "only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used."
Using not 2input gates make estimation of logic operations a little bit harder.

Or do You mean smth different ?!




Waiting Your answers and comments to proceed to final estimations.
staff
Activity: 4326
Merit: 8951
or smth similiar to memory relays used.
Not using memory will brutally hurt your performance.

Quote
This means about ~4trillion transistors while AMD Ryzen 9 CPU has only 9billion transistors.
Well 4 trillion sounds wrong but a ryzen 9 cpu takes a great many clock cycles to complete one operation. It wouldn't be shocking to me if an unrolled implementation had a gate count comparable to a big cpu.

... though also most of that cpu's gates are in sram, and you've excluded memory sooo...

jr. member
Activity: 42
Merit: 18
So I had estimations draft, but was too shy to include it in my first post.

Now I include it:

Here is my ruby code to calculate privkey->pubkey ECC(secp256k1).(see at end of post)

As we see calculations consists of 256round.

Each round(if points are pre-computated) consists of:
256bit modular summarize(SUMM): 0 operations
256bit modular substraction(SUBM): 7 operations
256bit modular multiply(MULM): 3 operations
256bit modular exponentional power(POWM): 1 operation

So 1round = 0SUMM + 7SUBM + 3MULM + 1POWM

It can be said that SUBM is almost identical in using logic gates as SUMM,
so to simplify our estimations let's agree that SUBM ~= SUMM.
By using binarian algorithm MULM can be represented as 511*SUMM, and POWM as 511*MULM.

So 1round = 0SUMM + 7SUBM + 3MULM + 1POWM
= 7SUMM + 3*MULM + 1POWM
= 7SUMM + 3*511SUMM + 1*511MULM
= 7SUMM + 3*511SUMM + 1*511*511SUMM
= 262661SUMM

256rounds = 256*262661SUMM = 67241216SUMM

Some years ago I was investigating electrical circuits.
As I remember 256bit modular summarizator preliminary consist of:
2000 NOT, 3000 AND, 7000 XOR logic gates.

However there are a lot of other ways how such summarizator can be represented using other combinations of logic gates. I feel(I hope?!) it can be optimized much.

1round = 262661SUMM * (2000 NOT, 3000 AND, 7000 XOR) = 525322000 NOT, 787983000 AND, 1838627000 XOR
256rounds = 256 * 262661SUMM * (2000 NOT, 3000 AND, 7000 XOR) = 134482432000 NOT, 201723648000 AND, 470688512000 XOR

Hm... 134 billions NOT gates, 201 billions AND gates, 470billions XOR gates.. or equal to almost 1trillion NAND gates ?!...

***

Code:
Gx = 0x79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798

Gy= 0x483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8

P = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F
 
#N=2**256 - 2**32 - 977
 

def powm(n, e, mod)
    fail ArgumentError, 'negative exponent' if e < 0
    prod = 1
    base = n % mod
    until e.zero?
      prod = (prod * base) % mod if e.odd? #e.odd can give error on newest ruby versions
      e >>= 1
      base = (base * base) % mod
    end
    prod
  end
   
    def add(x1,y1,x2,y2)   
        px, py = x1,y1
        qx, qy = x2,y2
       
   if px == qx && py==qy then       
        lam = (3 * px * px) * powm(2 * py, P - 2, P)     #can be ignored in estimations
   else
        lam = (qy - py) * powm(qx - px, P - 2, P)
   end
   
        rx = lam*lam - px - qx
        ry = lam * (px - rx) - py

      [rx % P, ry % P]
 end
   
       
privkey=0x18E14A7B6A307F426A94F8114701E7C8E774E7F9A47E2C2035DB29A206321725

          tx=Gx     
          ty=Gy
   
       rx=0
       ry=0   
   
for i in 0..255

             if privkey[i]==1 then
         
               if rx == 0 && ry==0  then                             
                  rx,ry=tx,ty
               else
                   rx,ry = add(rx, ry,tx,ty)
               end         
                           
             end
             
            tx,ty = add(tx, ty,tx,ty)       
     
   end
   
   
puts rx,ry
         



     
 
jr. member
Activity: 42
Merit: 18
pooya87, no way !
any math formulae can be represented using combination of AND,OR,NOT gates.
then, using Morgan's law, can be converted to AIGER(and-inverted graph, only AND and NOT gates) or to OIGER(or-inverted graph, only OR and NOT gates).
then can be converted to only-NAND or to only-NOR gates schematics. however such convertations leads to growing amount of transistors, so to minimize transistors count usually it is worth to use mix of AND,OR,NOT gates, also worth adding XOR gates.

NotATether, You are speaking about slightly different thing. Yous speak about signing messages using ECC, I speak about privkey-to-pubkey calculations using ECC.

j2002ba2, Your answer is brilliant ! I see You have excellent knowledgebase and practicebase in ECC. Bravissimo !
It took me hours to carefully read and analyse Your message. I was analysing my calculations against Your calculations, also analysing Binary extended gcd algorithm and so on...

I will write more in next my messages...
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
From your gates construct add-mod-p, sub-mod-p, mul-mod-p, binary-extended-gcd-algorithm. Mix them appropriately.

add-mod-p would be 256+256-to-257 bit adder, 257-bit adder, 256-bit multiplexer
sub-mod-p would be 256 bit not, 256+256+1-to-257 bit adder, 257-bit adder, 256-bit multiplexer
mul-mod-p would use add-mod-p + multiplexor (+ sub-mod-p for doubling) up to 256 times, depending on the number we multiply to
binary-extended-gcd can be found in chapter 14 of Handbook of Applied Cryptography, available online at http://cacr.uwaterloo.ca/hac/
Inversion is done only once via b-e-gcd: since v=y=1 skip steps 2 and 5.

Since G is fixed, you should precompute 20-255G, and use them via multiplexer.

If it is less costly (less gates), you could group powers of G by 2, 3, or more bits, using more multiplexers.

This all counts the gates for just the operations after the point multiplication step right?

For the point multiplication there's an algorithm here (https://eprint.iacr.org/2014/427.pdf, section 3, Algorithm 1) that uses Montgomery ladders to implement the point multiplication for Koblitz curves. We need about n-1 add-mod-p gates, and n is a random int between 1 and group order-1 and likely extremely large. I'm not sure how you'd fit all those gates, unless you construct some kind of loop for the add-mod-p gate (is this possible with multiplexers?)
full member
Activity: 206
Merit: 450
But unlike SHA256 (or hashes in general) there aren't any bitwise operations in elliptic curve cryptography. It is all pure math that also contains multiplication and division that can't be converted to logical operations either.

This is a ridiculous statement. All math on computer is done via logical elements. Always has been.


So, What is the theoretical minimum number of logical operations an ASIC needs to perform to compute privkey->pubkey secp256k1 ?

With some additional limitations:
only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used.
only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.


From your gates construct add-mod-p, sub-mod-p, mul-mod-p, binary-extended-gcd-algorithm. Mix them appropriately.

add-mod-p would be 256+256-to-257 bit adder, 257-bit adder, 256-bit multiplexer
sub-mod-p would be 256 bit not, 256+256+1-to-257 bit adder, 257-bit adder, 256-bit multiplexer
mul-mod-p would use add-mod-p + multiplexor (+ sub-mod-p for doubling) up to 256 times, depending on the number we multiply to
binary-extended-gcd can be found in chapter 14 of Handbook of Applied Cryptography, available online at http://cacr.uwaterloo.ca/hac/
Inversion is done only once via b-e-gcd: since v=y=1 skip steps 2 and 5.

Since G is fixed, you should precompute 20-255G, and use them via multiplexer.

If it is less costly (less gates), you could group powers of G by 2, 3, or more bits, using more multiplexers.

A minimal example:       there might be mistakes

256-bit adder: 255*(2*XOR + 2*AND +1*OR) + 1*XOR + 1*AND = 511*XOR + 511*AND + 256*OR
257-bit adder: 513*XOR + 513*AND + 257*OR
256-bit adder with carry input: 512*XOR + 512*AND + 256*OR
256-bit sub: 512*XOR + 512*AND + 256*OR + 256*NOT
256-bit multiplexer: 512*AND + 256*OR + 1*NOT

add-mod-p: 0x400*XOR + 0x600*AND + 0x301*OR + 1*NOT
sub-mod-p: 0x401*XOR + 0x601*AND + 0x301*OR + 0x101*NOT
mul-mod-p: 0x100*add-mod-p + 0x100*sub-mod-p + 0x200*multiplex
              = 0x80100*XOR + 0xe0100*AND + 0x80200*OR + 0x30200*NOT
inversion: 0x100*add + 0x300*sub + 0x200*multiplex
              = 0x7ff00*XOR + 0xbff00*AND + 0x60000*OR + 0x30200*NOT

Looks like inversion is less costly than mul-mod-p. (There might be more efficient mul-mod-p method though.)
We'll use affine coordinates then.
In affine coordinates point addition is 1*inversion + 2*mul-mod-p + 6*sub-mod-p
  = 0x181906*XOR + 0x280706*AND + 0x161606*OR + 0x90906*NOT

256 point-additions + 256 multiplexers: 0x18190600*XOR + 0x28090600*AND + 0x14170600*OR + 0x9090700*NOT

some missing logic for the leading zero bits: less than 4K
missing inversion logic: less than 1M

total: 0x5f421900= 1,598,167,296 gates.

Combining several bits:

 2: 128 PA +   4*128 mux = 799,378,944
 4: 64 PA +    16*64 mux = 400,280,064
 7: 37 PA +   128*37 mux = 234,598,648
 8: 32 PA +   256*32 mux = 206,045,952
 9: 29 PA +   512*29 mux = 192,438,200
10: 26 PA +  1024*26 mux = 182,767,728
11: 24 PA +  2048*24 mux = 187,607,616
16: 16 PA + 65536*16 mux = 906,228,096


Looks like the best combination is to do additions in groups of 10 bits. 182,767,728 gates (plus maybe a million more).

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
But unlike SHA256 (or hashes in general) there aren't any bitwise operations in elliptic curve cryptography. It is all pure math that also contains multiplication and division that can't be converted to logical operations either.

We could make a rough estimate if we look at an ECDSA implementation that uses 256-bit wide keys (like the secp256k1 curve used in bitcoin). As the key size changes, the size of the base point and order also changes which I think will increase the number of logic gates correspondingly.



So according to this https://en.wikipedia.org/wiki/Elliptic_Curve_Digital_Signature_Algorithm#Signature_generation_algorithm, we need to find the number of logic gates needed to:

- make a sha256 hash
- do a right shift of N bits (to get leftmost Ln bits)
- generate a random number
- do a curve multiplication
- do a modulus
- do a regular multiplication, an addition, the modular inverse of said random number, and then a second modulus. All in 256-bits.

So I think a transistor is needed for each of 256-bit addition/multiplication/modinverse/modulus and the number of logic gates in those can be found independently. The number of logic gates for right-shift should also be simple to get. And OP already has a rough idea for SHA256.  Curve multiplication still needs to be accounted for, as does random number generation.
legendary
Activity: 3472
Merit: 10611
But unlike SHA256 (or hashes in general) there aren't any bitwise operations in elliptic curve cryptography. It is all pure math that also contains multiplication and division that can't be converted to logical operations either.
jr. member
Activity: 42
Merit: 18
Greetings to bitcointalkers !

On bitcointalk forum I've found theme
"Theoretical minimum # of logic operations to perform double iterated SHA256?"
https://bitcointalksearch.org/topic/theoretical-minimum-of-logic-operations-to-perform-double-iterated-sha256-1029536

I have the same question, but related not to sha256, but to secp256k1.

So, What is the theoretical minimum number of logical operations an ASIC needs
to perform to compute privkey->pubkey secp256k1 ?

With some additional limitations:
only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used.
only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.

I was trying to make some estimations myself, but getting absurdically large results.
Smth about 1trillion(not billion!) NAND gates. This means about ~4trillion transistors while AMD Ryzen 9 CPU has only 9billion transistors.

Thanks in advance for real advices and help. 8-)
Jump to: