Theoretical minimum # of logic operations to perform privkey->pubkey(secp256k1) - page 3.

j2002ba2

full member

Activity: 206

Merit: 450

From "SPA vulnerabilities of the binary extended Euclidean algorithm":Binary extended Euclidean algorithm Inputs: Integers k and p such that gcd(k, p) = 1 Output: k⁻¹ mod p 1: u = k 2: v = p 3: A = 1 4: C = 0 5: while u != 0 do 6: while even(u) do 7: u = u/2 8: if even(A) then 9: A = A/2 10: else 11: A = (A+p)/2 12: while even(v) do 13: v = v/2 14: if even(C) then 15: C = C/2 16: else 17: C = (C+p)/2 18: if u ≥ v then 19: u = u-v 20: A = A-C 21: else 22: v = v-u 23: C = C-A 24: return C mod p

After unrolling the cycles, we could at each iteration do either of
* u = u/2, A = A/2 or (A+p)/2
* v = v/2, C = C/2 or (C+p)/2
* u = u-v, A = A-C
* v = v-u, C = C-A

So, a single loop cycle would be:
1. In parallel:
* Check if u = 0
* Check if u >= v
* u1 = u/2 (drop lsb)
* u2 = u-v (256-bit subtraction)
* v1 = v/2
* v2 = v-u
* A1 = A/2
* A2 = (A+p)/2 (n-bit addition, dropping lsb)
* A3 = A-C (n-bit subtraction)
* C1 = C/2
* C2 = (C+p)/2
* C3 = C-A
For u and v we need 3 muxes each - passthrough (old value), u1, u2
For A and C we need 4 muxes each - passthrough, A1, A2, A3

2. Select what happens in this cycle, default being passthrough, highest priority first:
* u = 0 : all passthrough
* u is even : u1, if A is even A1 else A2
* v is even : v1, if C is even C1 else C2
* u >= v : u2, A3
* else : v2, C3

What remains is the maximum number of iterations of the cycle, and how many bits (n) are needed for C and A.

j2002ba2

full member

Activity: 206

Merit: 450

Most probably I made a mistake in binary extended gcd. It should be slower than mul-mod-p.

I'll post a detailed algorithm -> logic tomorrow.

hakabit

jr. member

Activity: 42

Merit: 18

Quote from: gmaxwell on December 04, 2020, 11:35:04 AM

Quote from: hakabit on December 04, 2020, 09:12:06 AM

However, j2002ba2 provided such great algorithm that there's ~~almost~~ no advantages in using Jacobian coordinates.

I am puzzled by the fact that you wrote this but then pasted code that uses Jacobian addition law as your example. What is your point?

Aside, the extended gcd is not guaranteed to complete in 256 steps (in fact, I expect it seldom completes in 256 steps)-- it can take much more. The constant time LSB euclidean we have for libsecp256k1 is only guaranteed to complete in 735 steps, and although non-LSB versions have somewhat smaller bounds they're still a lot larger than 256.

It's a little annoying to prove an upper bound on the operations of these GCD algorithms.

I'm surprised to hear the claim that when working with basic logic the inversion could be cheaper than a field multiply. In fact, I don't believe it: I think this is an artifact of radically over-costing the modular multiply and underestimating the number of iterations needed.

Even if inversion is slow, depending on what you're doing it still makes sense to use affine logic because inversions can be batched (though batching only makes sense if inversion is at least three times the cost of a modular multiplication).

gmaxwell,
I have a very base knowledge in cryptography, some knowledge in programming and in electronic circuits, and no knowledge in high grade mathematica.
I am doing my research from scratch. I am sorry in advance if I am sayingh something really wrong.

My point is that
0) (my first version using Euler's theorem algorithm uses ~511mul for each round. no way to discuss.)

1) Jacobian algorithm uses 17mod-multiplications(mul) and 6mod-substractions(sub) for each round.
Preliminary 1mul ~= 511mod-additions(add), and 1sub ~= 1add.
So in total 17mul + 6sub = 17*511add + 6add = 8693add.

2) And binarian Euclidean algorithm uses 256*(1add+3sub+2mux) for each round inverse and 3mul + 6sub for additional calculations.
Preliminary 1mux ~= 1add.
So in total 256*(1add+3sub+2mux) +(3mul+6sub) = 256*(6add) + 3*511add + 6add = 1536 + 1533 + 6 = 3075add

So 3075add < 8693add.

Now, You are saying that 256 iterations not enough, and we need 735. (let's round up to 768)

Ok. (768/256 - 256/256) * 1536add = 2 * 1536add = 3072add need to be added.

3075add + 3072add = 6147add.

Still 6147add < 8693add.

What about modular multiplication optimization.
You see, Jacobian has 17mul and Euclidean 3mul. So any mod-mul optimizations will be useful for both algorithms. However, of course in relative numbers it is more helpful for Jacobian, but not the fact that in absolute numbers Jacobian overbeat.

For optimizing classical(non-modular) multiplications I've heard about Karatsuba algorithm.
For optimizing modular multiplications usually mentioned Montgomery algorithm.
At this moment I absolutely do not understand how do they work both of them. :-)

Also, I am not sure,
gmaxwell, so do You feel affine or Jacobian algo should use smaller amount of logical operations ?!

gmaxwell

staff

Activity: 4326

Merit: 8951

Quote from: hakabit on December 04, 2020, 09:12:06 AM

However, j2002ba2 provided such great algorithm that there's ~~almost~~ no advantages in using Jacobian coordinates.

I am puzzled by the fact that you wrote this but then pasted code that uses Jacobian addition law as your example. What is your point?

Aside, the extended gcd is not guaranteed to complete in 256 steps (in fact, I expect it seldom completes in 256 steps)-- it can take much more. The constant time LSB euclidean we have for libsecp256k1 is only guaranteed to complete in 735 steps, and although non-LSB versions have somewhat smaller bounds they're still a lot larger than 256.

It's a little annoying to prove an upper bound on the operations of these GCD algorithms.

I'm surprised to hear the claim that when working with basic logic the inversion could be cheaper than a field multiply. In fact, I don't believe it: I think this is an artifact of radically over-costing the modular multiply and underestimating the number of iterations needed.

Even if inversion is slow, depending on what you're doing it still makes sense to use affine logic because inversions can be batched (though batching only makes sense if inversion is at least three times the cost of a modular multiplication).

hakabit

jr. member

Activity: 42

Merit: 18

Today I was trying to find more effective ways to ECC point additions.
I paid attention to Jacobian coordinate system.

I found some Python code on internet, then mixed them to my test-code.(see at the end of post)

On my PC with Python2 it gives answer "(36422191471907241029883925342251831624200921388586025344128047678873736520530L,
20277110887056303803699431755396003735040374760118964734768299847012543114150L)"
and that is correct answer.

We need to pay attention that it still needs modular inversion, but that is done only for last point addition from all 256 in secp256k1.

However, j2002ba2 provided such great algorithm that there's ~~almost~~ no advantages in using Jacobian coordinates.

Here is list of operations:
   U1 = (Xp * Zq ** 2) % P
   U2 = (Xq * Zp ** 2) % P
   S1 = (Yp * Zq ** 3) % P
   S2 = (Yq * Zp ** 3) % P
   H = U2 - U1
   R = S2 - S1
   H2 = (H * H) % P
   H3 = (H * H2) % P
   U1H2 = (U1 * H2) % P
   nx = (R ** 2 - H3 - 2 * U1H2) % P
   ny = (R * (U1H2 - nx) - S1 * H3) % P
   nz = (H * Zp * Zq) % P

In fact, 17mod-multiplications and 6mod-substractions for each round.

Code:

def inv(a, m):
    if a < 0 or m <= a: a = a % m
    c, d = a, m
    uc, vc, ud, vd = 1, 0, 0, 1
    while c != 0:
        q, c, d = divmod(d, c) + (c,)
        uc, vc, ud, vd = ud - q*uc, vd - q*vc, uc, vc
    if ud > 0: return ud
    else: return ud + m

def to_jacobian((Xp, Yp)):
    """
    Convert point to Jacobian coordinates

    :param (Xp,Yp,Zp): First Point you want to add
    :return: Point in Jacobian coordinates
    """
    return (Xp, Yp, 1)

def from_jacobian((Xp, Yp, Zp), P):
    """
    Convert point back from Jacobian coordinates

    :param (Xp,Yp,Zp): First Point you want to add
    :param P: Prime number in the module of the equation Y^2 = X^3 + A*X + B (mod p)
    :return: Point in default coordinates
    """
    z = inv(Zp, P)
    return ((Xp * z**2) % P, (Yp * z**3) % P)

def jacobian_add((Xp, Yp, Zp), (Xq, Yq, Zq), A, P):
    """
    Add two points in elliptic curves

    :param (Xp,Yp,Zp): First Point you want to add
    :param (Xq,Yq,Zq): Second Point you want to add
    :param P: Prime number in the module of the equation Y^2 = X^3 + A*X + B (mod p)
    :param A: Coefficient of the first-order term of the equation Y^2 = X^3 + A*X + B (mod p)
    :return: Point that represents the sum of First and Second Point
    """
    if not Yp:
        return (Xq, Yq, Zq)
    if not Yq:
        return (Xp, Yp, Zp)
    U1 = (Xp * Zq ** 2) % P
    U2 = (Xq * Zp ** 2) % P
    S1 = (Yp * Zq ** 3) % P
    S2 = (Yq * Zp ** 3) % P
    if U1 == U2:
        if S1 != S2:
            return (0, 0, 1)
        return jacobian_double((Xp, Yp, Zp), A, P)
    H = U2 - U1
    R = S2 - S1
    H2 = (H * H) % P
    H3 = (H * H2) % P
    U1H2 = (U1 * H2) % P
    nx = (R ** 2 - H3 - 2 * U1H2) % P
    ny = (R * (U1H2 - nx) - S1 * H3) % P
    nz = (H * Zp * Zq) % P
    return (nx, ny, nz)

def fast_add(a,b, A, P):
    return from_jacobian(jacobian_add(to_jacobian(a), to_jacobian(b), A, P), P)

P = 2**256 - 2**32 - 977
A=0

a=39006303077722201472019215118849942292522572644014512222183524570138876820125,67456145645462716033455438706152384917972612258004579542185540136844047420439

b=4032983015753143990395647783770666587927265353624430905763286836981504199392,44353125519324157186344456159742269880631179110473143840214086765587351124293

g2=fast_add(a,b,A,P)

print g2

hakabit

jr. member

Activity: 42

Merit: 18

I am sorry, after some thinking, I think I've made mistake saying that j2002ba2 is ignoring recursion.

He was writing "inversion: 0x100*add + 0x300*sub + 0x200*multiplex" (but that's in HEX)
In DEC it's "inversion: 256*add + 768*sub + 512*multiplex"

So it coincide with 256iterations.

...and still trying to understand how he got optimizing from 1,598,167,296 gates to 182,767,728 gates...

hakabit

jr. member

Activity: 42

Merit: 18

Quote from: gmaxwell on December 03, 2020, 08:40:13 PM

Quote from: hakabit on December 02, 2020, 08:10:24 PM

or smth similiar to memory relays used.

Not using memory will brutally hurt your performance.

Quote

This means about ~4trillion transistors while AMD Ryzen 9 CPU has only 9billion transistors.

Well 4 trillion sounds wrong but a ryzen 9 cpu takes a great many clock cycles to complete one operation. It wouldn't be shocking to me if an unrolled implementation had a gate count comparable to a big cpu.

... though also most of that cpu's gates are in sram, and you've excluded memory sooo...

but my question is theoretical. Roll Eyes

so performance is not important.
you see, previously someone had interest to sha256, now I have interest to secp256k1.
I think for many bitcointalkers it's very interesting what the hell is amount of logical operations needed to calculate secp256k1.

yeah, 4trillion transistors and for me sounds absurdically large. so that's why I am looking for help on forum.
Now I almost for sure that I've made mistake saying that 256bits modular adder consist of 2000 NOT, 3000 AND, 7000 XOR logic gates.
It's more correct to say that 200 NOT, 300 AND, 700 XOR logic gates.
So in fact we will have not 4trillion transistors, but 400billion transistors. But that's still is unacceptably large amount.

j2002ba2 provided result total: 0x5f421900= 1,598,167,296 gates.
But I feel he just skipped that fact that Euclidean algorithm is recursive. His final result can be even larger than mine.

hakabit

jr. member

Activity: 42

Merit: 18

Quote from: j2002ba2 on December 03, 2020, 07:19:22 AM

Quote from: hakabit on December 02, 2020, 08:10:24 PM

So, What is the theoretical minimum number of logical operations an ASIC needs to perform to compute privkey->pubkey secp256k1 ?

With some additional limitations:
only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used.
only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.

A minimal example: there might be mistakes

256-bit adder: 255*(2*XOR + 2*AND +1*OR) + 1*XOR + 1*AND = 511*XOR + 511*AND + 256*OR
257-bit adder: 513*XOR + 513*AND + 257*OR
256-bit adder with carry input: 512*XOR + 512*AND + 256*OR
256-bit sub: 512*XOR + 512*AND + 256*OR + 256*NOT
256-bit multiplexer: 512*AND + 256*OR + 1*NOT

add-mod-p: 0x400*XOR + 0x600*AND + 0x301*OR + 1*NOT
sub-mod-p: 0x401*XOR + 0x601*AND + 0x301*OR + 0x101*NOT
mul-mod-p: 0x100*add-mod-p + 0x100*sub-mod-p + 0x200*multiplex
   = 0x80100*XOR + 0xe0100*AND + 0x80200*OR + 0x30200*NOT
inversion: 0x100*add + 0x300*sub + 0x200*multiplex
   = 0x7ff00*XOR + 0xbff00*AND + 0x60000*OR + 0x30200*NOT

Looks like inversion is less costly than mul-mod-p. (There might be more efficient mul-mod-p method though.)
We'll use affine coordinates then.
In affine coordinates point addition is 1*inversion + 2*mul-mod-p + 6*sub-mod-p
  = 0x181906*XOR + 0x280706*AND + 0x161606*OR + 0x90906*NOT

256 point-additions + 256 multiplexers: 0x18190600*XOR + 0x28090600*AND + 0x14170600*OR + 0x9090700*NOT

some missing logic for the leading zero bits: less than 4K
missing inversion logic: less than 1M

total: 0x5f421900= 1,598,167,296 gates.

I see we are getting almost identical results for 1round, but using slightly different terminology.

My version is "So 1round = 0SUMM + 7SUBM + 3MULM + 1POWM".
Your version is "In affine coordinates point addition is 1*inversion + 2*mul-mod-p + 6*sub-mod-p"

SUBM = sub-mod-p
MULM = mul-mod-p

My POWM means that I am using POWM as of Euler's theorem for modular inversion calculations.
Your "inversion" leads to "Euclidean Binary extended gcd algorithm".

Now what about differences:

1) As I see for SUMM(add-mod-p) I provided: 2000 NOT, 3000 AND, 7000 XOR logic gates.
You provide: 0x400*XOR + 0x600*AND + 0x301*OR + 1*NOT (in HEX)
Or 1024*XOR + 1536*AND + 769*OR + 1*NOT(in DEC)

May be I have memory faults and there was 200 NOT, 300 AND, 700 XOR logic gates.
In any case I need to make fresh research to count how many gates need to be used for 256bits modular full-adder.

2) As I understand I forgot to include multiplexors as switchers to pre-computated points.

3) In "14.61 Algorithm Binary extended gcd algorithm" step 6. I see "If u ≥ v then". This means we need to use 256bits comparator. As I understand You forgot to include it.

By the way, 256bits comparator can consume even more logical gates than 256bits full adder. (at this moment I am doing some research).
That's because adder uses only 1bit(1wire) for carry-out and carry-in,
while comparator uses either aggregated 2bit(2wires) or segregated 3bits(3wires) quasi-carry for carry-in and carry-out.

4) In "14.61 Algorithm Binary extended gcd algorithm" step 7. I see " otherwise, go to step 4.".
This means that algorithm is recursive. How many maximum recursions there can be ?! I suppose maximum 256.
This means we need to make dummy cycle with 256 iterations,
each iteration should contain all(!) "14.61 Algorithm Binary extended gcd algorithm" calculations except step 2. and step 5. (and also exclude step 1. 3. oops)

I feel after including 256iterations the result will be that Euclidean algorithm will consume almost the same amount of logic gates as for algorithm using Euler's theorem.

It's interesting note, that v=y=1. Yes, v=1, actually. What about y=1 it's not obvious for me from first sight, but I feel You are correct.

And here are some more comments:

1) "14.61 Algorithm Binary extended gcd algorithm" on input uses unsigned integers, but all futher operations are with signed integers. Operations with signed integers consumes 2-3 times more logic gates than operations with unsigned integers.

2) On step 4.2 I see "If A ≡ B ≡ 0 (mod 2) then". If A>0 and B>0 then it is very light operation which equal to "if A and B are both even" - we just need to verify if last bit=0.
But... there can be situations when A<0 and/or B<0. Is it enough to verify that last bit=1 for negative integers ?! At this moment I am not sure. I need some time to think.

3) You are saying "If it is less costly (less gates), you could group powers of G by 2, 3, or more bits, using more multiplexers."

Do You mean that privkey not always looks like

1101111101111011 {skipped some bits} 111110111110111

But also can looks like
1001111001110011 {skipped some bits} 111110011100111 (reason for grouping powers of G by 2)

Or like
1000110001100011 {skipped some bits} 111100011000111 (reason for grouping powers of G by 3)

If I understand Your idea correctly, then my opinion is that such grouping has many pros and cons.
I don't know if it less costly. I think it's needed to make many-many test-iterations to measure pros vs. cons.

4) I don't understand Your idea about "Combining several bits" and table provided with "0: 26 PA + 1024*26 mux = 182,767,728".

Do You mean that there will be stand-alone 10bits full-adder, with 10bits+10bits input and 10bits output, so we can seqeuntially passing to it A and B, and getting back result C=A+B ?

If yes, then it is against limitation "only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.".
You see, such trick will lower number of gates, but not lower number of logic operations, also such trick will make estimation of logic operations much harder.

Or do you mean logic gates with not 2input but with 10inputs ? Then it is against limitation "only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used."
Using not 2input gates make estimation of logic operations a little bit harder.

Or do You mean smth different ?!

Waiting Your answers and comments to proceed to final estimations.

gmaxwell

staff

Activity: 4326

Merit: 8951

Quote from: hakabit on December 02, 2020, 08:10:24 PM

or smth similiar to memory relays used.

Not using memory will brutally hurt your performance.

Quote

This means about ~4trillion transistors while AMD Ryzen 9 CPU has only 9billion transistors.

Well 4 trillion sounds wrong but a ryzen 9 cpu takes a great many clock cycles to complete one operation. It wouldn't be shocking to me if an unrolled implementation had a gate count comparable to a big cpu.

... though also most of that cpu's gates are in sram, and you've excluded memory sooo...

hakabit

jr. member

Activity: 42

Merit: 18

So I had estimations draft, but was too shy to include it in my first post.

Now I include it:

Here is my ruby code to calculate privkey->pubkey ECC(secp256k1).(see at end of post)

As we see calculations consists of 256round.

Each round(if points are pre-computated) consists of:
256bit modular summarize(SUMM): 0 operations
256bit modular substraction(SUBM): 7 operations
256bit modular multiply(MULM): 3 operations
256bit modular exponentional power(POWM): 1 operation

So 1round = 0SUMM + 7SUBM + 3MULM + 1POWM

It can be said that SUBM is almost identical in using logic gates as SUMM,
so to simplify our estimations let's agree that SUBM ~= SUMM.
By using binarian algorithm MULM can be represented as 511*SUMM, and POWM as 511*MULM.

So 1round = 0SUMM + 7SUBM + 3MULM + 1POWM
= 7SUMM + 3*MULM + 1POWM
= 7SUMM + 3*511SUMM + 1*511MULM
= 7SUMM + 3*511SUMM + 1*511*511SUMM
= 262661SUMM

256rounds = 256*262661SUMM = 67241216SUMM

Some years ago I was investigating electrical circuits.
As I remember 256bit modular summarizator preliminary consist of:
2000 NOT, 3000 AND, 7000 XOR logic gates.

However there are a lot of other ways how such summarizator can be represented using other combinations of logic gates. I feel(I hope?!) it can be optimized much.

1round = 262661SUMM * (2000 NOT, 3000 AND, 7000 XOR) = 525322000 NOT, 787983000 AND, 1838627000 XOR
256rounds = 256 * 262661SUMM * (2000 NOT, 3000 AND, 7000 XOR) = 134482432000 NOT, 201723648000 AND, 470688512000 XOR

Hm... 134 billions NOT gates, 201 billions AND gates, 470billions XOR gates.. or equal to almost 1trillion NAND gates ?!...

***

Code:

Gx = 0x79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798

Gy= 0x483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8

P = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F
  
#N=2**256 - 2**32 - 977
 

def powm(n, e, mod)
    fail ArgumentError, 'negative exponent' if e < 0
    prod = 1
    base = n % mod
    until e.zero?
      prod = (prod * base) % mod if e.odd? #e.odd can give error on newest ruby versions
      e >>= 1
      base = (base * base) % mod
    end
    prod
  end
   
    def add(x1,y1,x2,y2)   
        px, py = x1,y1
        qx, qy = x2,y2
       
   if px == qx && py==qy then       
        lam = (3 * px * px) * powm(2 * py, P - 2, P)     #can be ignored in estimations
   else
        lam = (qy - py) * powm(qx - px, P - 2, P)
   end
   
        rx = lam*lam - px - qx
        ry = lam * (px - rx) - py

      [rx % P, ry % P]
 end
   
        
privkey=0x18E14A7B6A307F426A94F8114701E7C8E774E7F9A47E2C2035DB29A206321725

          tx=Gx     
          ty=Gy
    
       rx=0
       ry=0    
    
for i in 0..255

             if privkey[i]==1 then
          
               if rx == 0 && ry==0  then                             
                  rx,ry=tx,ty
               else
                   rx,ry = add(rx, ry,tx,ty)
               end         
                           
             end
             
            tx,ty = add(tx, ty,tx,ty)        
      
   end
    
    
puts rx,ry

hakabit

jr. member

Activity: 42

Merit: 18

pooya87, no way !
any math formulae can be represented using combination of AND,OR,NOT gates.
then, using Morgan's law, can be converted to AIGER(and-inverted graph, only AND and NOT gates) or to OIGER(or-inverted graph, only OR and NOT gates).
then can be converted to only-NAND or to only-NOR gates schematics. however such convertations leads to growing amount of transistors, so to minimize transistors count usually it is worth to use mix of AND,OR,NOT gates, also worth adding XOR gates.

NotATether, You are speaking about slightly different thing. Yous speak about signing messages using ECC, I speak about privkey-to-pubkey calculations using ECC.

j2002ba2, Your answer is brilliant ! I see You have excellent knowledgebase and practicebase in ECC. Bravissimo !
It took me hours to carefully read and analyse Your message. I was analysing my calculations against Your calculations, also analysing Binary extended gcd algorithm and so on...

I will write more in next my messages...

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: j2002ba2 on December 03, 2020, 07:19:22 AM

From your gates construct add-mod-p, sub-mod-p, mul-mod-p, binary-extended-gcd-algorithm. Mix them appropriately.

add-mod-p would be 256+256-to-257 bit adder, 257-bit adder, 256-bit multiplexer
sub-mod-p would be 256 bit not, 256+256+1-to-257 bit adder, 257-bit adder, 256-bit multiplexer
mul-mod-p would use add-mod-p + multiplexor (+ sub-mod-p for doubling) up to 256 times, depending on the number we multiply to
binary-extended-gcd can be found in chapter 14 of Handbook of Applied Cryptography, available online at http://cacr.uwaterloo.ca/hac/
Inversion is done only once via b-e-gcd: since v=y=1 skip steps 2 and 5.

Since G is fixed, you should precompute 2^0-255G, and use them via multiplexer.

If it is less costly (less gates), you could group powers of G by 2, 3, or more bits, using more multiplexers.

This all counts the gates for just the operations after the point multiplication step right?

For the point multiplication there's an algorithm here (https://eprint.iacr.org/2014/427.pdf, section 3, Algorithm 1) that uses Montgomery ladders to implement the point multiplication for Koblitz curves. We need about n-1 add-mod-p gates, and n is a random int between 1 and group order-1 and likely extremely large. I'm not sure how you'd fit all those gates, unless you construct some kind of loop for the add-mod-p gate (is this possible with multiplexers?)

j2002ba2

full member

Activity: 206

Merit: 450

Quote from: pooya87 on December 02, 2020, 11:52:05 PM

But unlike SHA256 (or hashes in general) there aren't any bitwise operations in elliptic curve cryptography. It is all pure math that also contains multiplication and division that can't be converted to logical operations either.

This is a ridiculous statement. All math on computer is done via logical elements. Always has been.

Quote from: hakabit on December 02, 2020, 08:10:24 PM

So, What is the theoretical minimum number of logical operations an ASIC needs to perform to compute privkey->pubkey secp256k1 ?

With some additional limitations:
only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used.
only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.

From your gates construct add-mod-p, sub-mod-p, mul-mod-p, binary-extended-gcd-algorithm. Mix them appropriately.

add-mod-p would be 256+256-to-257 bit adder, 257-bit adder, 256-bit multiplexer
sub-mod-p would be 256 bit not, 256+256+1-to-257 bit adder, 257-bit adder, 256-bit multiplexer
mul-mod-p would use add-mod-p + multiplexor (+ sub-mod-p for doubling) up to 256 times, depending on the number we multiply to
binary-extended-gcd can be found in chapter 14 of Handbook of Applied Cryptography, available online at http://cacr.uwaterloo.ca/hac/
Inversion is done only once via b-e-gcd: since v=y=1 skip steps 2 and 5.

Since G is fixed, you should precompute 2^0-255G, and use them via multiplexer.

If it is less costly (less gates), you could group powers of G by 2, 3, or more bits, using more multiplexers.

A minimal example: there might be mistakes

256-bit adder: 255*(2*XOR + 2*AND +1*OR) + 1*XOR + 1*AND = 511*XOR + 511*AND + 256*OR
257-bit adder: 513*XOR + 513*AND + 257*OR
256-bit adder with carry input: 512*XOR + 512*AND + 256*OR
256-bit sub: 512*XOR + 512*AND + 256*OR + 256*NOT
256-bit multiplexer: 512*AND + 256*OR + 1*NOT

add-mod-p: 0x400*XOR + 0x600*AND + 0x301*OR + 1*NOT
sub-mod-p: 0x401*XOR + 0x601*AND + 0x301*OR + 0x101*NOT
mul-mod-p: 0x100*add-mod-p + 0x100*sub-mod-p + 0x200*multiplex
   = 0x80100*XOR + 0xe0100*AND + 0x80200*OR + 0x30200*NOT
inversion: 0x100*add + 0x300*sub + 0x200*multiplex
   = 0x7ff00*XOR + 0xbff00*AND + 0x60000*OR + 0x30200*NOT

Looks like inversion is less costly than mul-mod-p. (There might be more efficient mul-mod-p method though.)
We'll use affine coordinates then.
In affine coordinates point addition is 1*inversion + 2*mul-mod-p + 6*sub-mod-p
  = 0x181906*XOR + 0x280706*AND + 0x161606*OR + 0x90906*NOT

256 point-additions + 256 multiplexers: 0x18190600*XOR + 0x28090600*AND + 0x14170600*OR + 0x9090700*NOT

some missing logic for the leading zero bits: less than 4K
missing inversion logic: less than 1M

total: 0x5f421900= 1,598,167,296 gates.

Combining several bits:
2: 128 PA + 4*128 mux = 799,378,944 4: 64 PA + 16*64 mux = 400,280,064 7: 37 PA + 128*37 mux = 234,598,648 8: 32 PA + 256*32 mux = 206,045,952 9: 29 PA + 512*29 mux = 192,438,200 10: 26 PA + 1024*26 mux = 182,767,728 11: 24 PA + 2048*24 mux = 187,607,616 16: 16 PA + 65536*16 mux = 906,228,096

Looks like the best combination is to do additions in groups of 10 bits. 182,767,728 gates (plus maybe a million more).

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: pooya87 on December 02, 2020, 11:52:05 PM

But unlike SHA256 (or hashes in general) there aren't any bitwise operations in elliptic curve cryptography. It is all pure math that also contains multiplication and division that can't be converted to logical operations either.

We could make a rough estimate if we look at an ECDSA implementation that uses 256-bit wide keys (like the secp256k1 curve used in bitcoin). As the key size changes, the size of the base point and order also changes which I think will increase the number of logic gates correspondingly.

So according to this https://en.wikipedia.org/wiki/Elliptic_Curve_Digital_Signature_Algorithm#Signature_generation_algorithm, we need to find the number of logic gates needed to:

- make a sha256 hash
- do a right shift of N bits (to get leftmost Ln bits)
- generate a random number
- do a curve multiplication
- do a modulus
- do a regular multiplication, an addition, the modular inverse of said random number, and then a second modulus. All in 256-bits.

So I think a transistor is needed for each of 256-bit addition/multiplication/modinverse/modulus and the number of logic gates in those can be found independently. The number of logic gates for right-shift should also be simple to get. And OP already has a rough idea for SHA256. Curve multiplication still needs to be accounted for, as does random number generation.

pooya87

legendary

Activity: 3472

Merit: 10611

But unlike SHA256 (or hashes in general) there aren't any bitwise operations in elliptic curve cryptography. It is all pure math that also contains multiplication and division that can't be converted to logical operations either.

hakabit

jr. member

Activity: 42

Merit: 18

Greetings to bitcointalkers !

On bitcointalk forum I've found theme
"Theoretical minimum # of logic operations to perform double iterated SHA256?"
https://bitcointalksearch.org/topic/theoretical-minimum-of-logic-operations-to-perform-double-iterated-sha256-1029536

I have the same question, but related not to sha256, but to secp256k1.

So, What is the theoretical minimum number of logical operations an ASIC needs
to perform to compute privkey->pubkey secp256k1 ?

With some additional limitations:
only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used.
only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.

I was trying to make some estimations myself, but getting absurdically large results.
Smth about 1trillion(not billion!) NAND gates. This means about ~4trillion transistors while AMD Ryzen 9 CPU has only 9billion transistors.

Thanks in advance for real advices and help. 8-)

Topic: Theoretical minimum # of logic operations to perform privkey->pubkey(secp256k1) - page 3. (Read 1291 times)