Pages:
Author

Topic: Theoretical minimum # of logic operations to perform privkey->pubkey(secp256k1) - page 2. (Read 1259 times)

jr. member
Activity: 42
Merit: 18

In affine coordinates: 1*inversion + 2*mul-mod-p + 6*sub-mod-p
  = 16,141,271 + 2*589,568 + 6*3,068 = 17,338,815
Jacobian plus affine: 11*mul-mod-p + 6*sub-mod-p
  = 11*589,568 + 6*3,068 = 6,503,656
256-bit-multiplex: 769


Some more questions.
1) what is Your formulae for affine point addition with only two multiplications ?
the most optimized version I found has multiplication+multiplication+squaring=3multiplications.
2) what is Your formulae for Jacobian point addition with only 11multiplications ?
the most optimized version I found has 17multiplications.

of course, there are Jacobian formulaes with smaller multiplications amount, but they are limited by Z1=1, so can be used only to add pre-computated or re-converted affine<->Jacobian points.

Also I would like to note that by test-runs for affine we have +-384iterations, but not +-760.
jr. member
Activity: 42
Merit: 18
Point addition


Jacobian coordinates (including final inversion and 2 mul-mod-p):
bit  PA   256-bit-muxes

1: 255   2*256 + 256           1,676,343,279

10:  25   2*(1024*25 + 64) + 26   219,403,033




Using mixed Jacobian-affine coordinates,
adding in groups of 10 bits (25 1024-point memories plus 64-point memory),
we get 219,403,033 gates.



Thanks for fresh info !

I still can't understand how You optimizing gates count by combining bits.
In Your initial post You was using such technique for additions, now I see You are using it for any operations.

Please, explain me in a very simple steps.

For example we have two 32bits numbers to summarize.

00111111111111111111111111111111
00000000000000000000000000000001

And the same numbers splitted to 10bit limbs.

1111111111 1111111111 1111111111
0000000000 0000000000 0000000001

Let's say we are using classical cascaded full-adder(ripple-carry adder)
How amount of gates can be lowered due to splitting ?!


Either... do You mean using not 2input, but 10input gates and summarize in parallel not two, but ten numbers ? But it's against my initial limitations.

Or is this technique in some kind similiar to Karatsuba algorithm ?!
full member
Activity: 206
Merit: 447
Point addition

In affine coordinates: 1*inversion + 2*mul-mod-p + 6*sub-mod-p
  = 16,141,271 + 2*589,568 + 6*3,068 = 17,338,815
Jacobian plus affine: 11*mul-mod-p + 6*sub-mod-p
  = 11*589,568 + 6*3,068 = 6,503,656
256-bit-multiplex: 769

Affine coordinates:
bit  PA   256-bit-muxes
 1: 255   2*256             4,421,791,553
 2: 127   2*4*128           2,202,816,961
 4:  63   2*16*64           1,093,920,257
 8:  31   2*256*32            550,102,561
16:  15   2*65536*16        1,872,792,113
 9:  28   2*(512*28 + 16)     507,560,196
10:  25   2*(1024*25 + 64)    472,941,607
11:  23   2*(2048*23 + 8 )    471,251,001
12:  21   2*(4096*21 + 16)    496,432,331


Jacobian coordinates (including final inversion and 2 mul-mod-p):
bit  PA   256-bit-muxes
 1: 255   2*256 + 256           1,676,343,279
 2: 127   2*4*128 + 128           844,170,607
 4:  63   2*16*64 + 64            428,674,863
 8:  31   2*256*32 + 32           231,557,647
16:  15   2*65536*16 + 16       1,727,597,439
 9:  28   2*(512*28 + 16) + 29    221,518,452
10:  25   2*(1024*25 + 64) + 26   219,403,033
11:  23   2*(2048*23 + 8 ) + 24   239,381,207



Using mixed Jacobian-affine coordinates,
adding in groups of 10 bits (25 1024-point memories plus 64-point memory),
we get 219,403,033 gates.

full member
Activity: 206
Merit: 447
Multiplication is cheaper since p is fixed.

1-bit full-add: 2*XOR + 2*AND + 1*OR
1-bit half-add: 1*XOR + 1*AND
1-bit full-add-zero: 1*XOR + 1*AND
1-bit full-add-one: 1*XNOR + 1*OR
1-bit full-subtract: 2*XOR + 2*AND + 1*OR + 2*NOT
1-bit half-subtract: 1*XOR + 1*AND + 1*NOT
1-bit half-add-one: 1*NOT + wire

256-bit-add: 255*full-add + 1*half-add
  = 511*XOR + 511*AND + 1*OR
256-bit sub: 255*full-subtract + 1*half-subtract
  = 511*XOR + 511*AND + 255*OR + 511*NOT
256-bit-add-p: 249*full-add-one + 1*half-add-one + 6*full-add-zero
  = 249*XNOR + 249*OR + 1*NOT + 6*XOR + 6*AND
256-bit-add-minus-p: 249*full-add-zero + 1*half-add-one + 6*full-add-one
  = 249*XOR + 249*AND + 1*NOT + 6*XNOR + 6*OR
256-bit mux: 768*NAND + 1*NOT

add-mod-p: 256-bit-add + 256-bit-add-minus-p + 256-bit-mux
 = 6*XNOR + 760*XOR + 760*AND + 7*OR + 768*NAND + 2*NOT
sub-mod-p: 256-bit-sub + 256-bit-add-p + 256-bit-mux
 = 249*XNOR + 517*XOR + 517*AND + 504*OR + 768*NAND + 513*NOT
mul-mod-p: 256*add-mod-p
 = 0x600*XNOR + 0x2f800*XOR + 0x2f800*AND + 0x700*OR + 0x30000*NAND + 0x200*NOT

mul-mod-p is 589,568 gates; this makes it ~27 times cheaper than inversion

jr. member
Activity: 42
Merit: 18
Here we go !
One more test code. no over 256bits bug, returns also symmetric by module result. a little bit fuzzy conditions checking logic, I can call it quasi-recursive. it's binary in it's nature, just realized in a kind strange way.

Part of results:

test# calls# in [out, mod symm out]
0 364 75316636100529471829340583998430204208230490280793297630878645139432980850156 [-21481456315583653538655726848489351884800251925024345069131289520822185938657, 33025674478099610959444009195219596039238050737410647384245105885135854051132]
1 385 39888426508246177183753123701414070167668113673592482698911298409940354217536 [-29960760575314177163053332470999754258247890690156906035102057524026276070769, 86973073792154943490949046208733472044349727771710623997020383424312036814343]
2 360 14620752759159978489146343667071343355650196237127974340021988638599091328303 [12174829735737702474779493260773244294171916983485771075179095797017886235155, -96421093662667652633700306227539987869225167103772496913502848083575395070788]
3 383 102729399071985721675035906741175076856965109198477353566672326776001633036975 [70580439539010164434200376999720640409016804834521829709981617744376762996302, -79555186999421925612988503851178736579078677622324526318628793408380057609471]
4 370 46339676628294076970974548809934900743231066044015694600462844368554430445143 [2539235054366037579706180091206398081627721978267432321188130108233551671292, -6344958648894604413478372293328991555118887139277093896810498521216296850165]
5 368 26660698059602402176427980216008131458879504029132938908230058389025515567933 [16540677529517895161096648910563987919480254183872798943295671897534551358308, -71839064538438794363947864942753805375404185837480918551631678838489697347191]
6 354 594851748071041868996674850995031382743985475622021247651263573480083167890 [-64638328581828718437634047594178337042202536377421456333765449858808997403, 12582306659716172481129154954429736186019811569145338237994747628567568503971]
7 359 80681079014327548209785902873606497791641955222184391014151038138305717861451 [44686311015859290531071619784617096434102136509364594238047222505744964738512, -64133020728639186296951506401894836855168854511276042070060900652427696988205]
8 372 82669520055979716581712547816646165489596227017635063706256738823534854476311 [45051940896140100306048968032786700593661378537963560664532177163878090793890, -63102560133743198571360833063775568154007385207992635050991710591879861205579]
9 360 55943743540139384302837049042449505019549115662377158567347079861108763368228 [-21482989564914934180757589310789560187172090496876602635626422561634142769637, 44465387680038776841839359752521363942810960966395512063353292864302945775019]
10 349 66724607627136654255694803441709862788832479174235937080362394192249729343898 [29069806528358354934610693165080779089541395635574791746246411666180327081765, -50446960294663850907746139715414905993368124572672804919153802516270964965753]
#
#
#

As in previous code iterations count close to 384number.
But it's not good iterations indicator. I have idea how to count real true unrolled iterations, but need some time to implement.


Code:
#code by hakabit supraspecially for gmaxwell
#algorithm source: https://dspace.cvut.cz/bitstream/handle/10467/77227/F3-DP-2018-Zhumaniezov-Alisher-priloha-Testing.py

import random

def bin_ext_gcd(a, b):

    global cc
    cc = cc + 1

    if a == 0:
        #print(a, b, 0, 1)
        return [0, 1]

    if b == 0:
        #print(a, b, 1, 0)
        return [1, 0]

    if a == b:
        #print(a, b, 1, 0)
        return [1, 0]

    if (a % 2 == 0) & (b % 2 == 0):
        x, y = bin_ext_gcd(a // 2, b // 2)
        #print(a, b, x, y)
        return [x, y]

    if a % 2 == 0:
        x, y = bin_ext_gcd(a // 2, b)
        if x % 2 == 0:
            x, y = x // 2, y
        else:
            x, y = (x + b) // 2, y - a // 2
        #print(a, b, x, y)
        return [x, y]

    if b % 2 == 0:
        x, y = bin_ext_gcd(a, b // 2)
        if y % 2 == 0:
            x, y = x, y // 2
        else:
            x, y = x - b // 2, (y + a) // 2
        #print(a, b, x, y)
        return [x, y]

    if a > b:
        x, y = bin_ext_gcd((a - b) // 2, b)
        if x % 2 == 0:
            x, y = x // 2, y - x // 2
        else:
            x, y = (x + b) // 2, y - (a - b) // 2 - (x + b) // 2
        #print(a, b, x, y)
        return [x, y]

    if a < b:
        x, y = bin_ext_gcd(a, (b - a) // 2)
        if y % 2 == 0:
            x, y = x - y // 2, y // 2
        else:
            x, y = x - (b - a) // 2 - (y + a) // 2, (y + a) // 2
        #print(a, b, x, y)
        return [x, y]

cc = 0

p = 2**256 - 2**32 - 977


#cc=0
#a=12
#res=bin_ext_gcd(p,a)
#print ("initial test, should be in=12, out=48246703848881748093154577086953294938862493610683568349773993336628681113193")
#print (res,cc)



print ("test# calls# in [out, mod symm out]")
for i in range(100):

    #print(i)
    ii = random.getrandbits(256)
    ii = ii % p

    if (ii < 2):
        ii = ii + 2

    cc = 0
    res = bin_ext_gcd(p, ii)
    print(i,cc,ii,res)

#end
jr. member
Activity: 42
Merit: 18
So here is my python code to test iterations.(see at the end)

And some part of test-results:

test# in (out, iterx, iter1, iter2, iter3, itersum):
0 51387622388232169033260144332317924780174711222857253999722512827559407809221 (42194227499242744317322493862270938245042350176722270391089169414969189792505, 173, 0, 177, 193, 370)
1 49303630616562965676786885508349510956169277855033004449417649058314657299485 (47016891994125063651541437063141305860063754426223854778073828393151612275997, 178, 0, 192, 172, 364)
2 60564124354985471474884769266985666656540021107491550318511011936878151315223 (96860706416467978990701996056838683430055307461271246547818608610050676002308, 181, 0, 175, 187, 362)
3 26876395035512405675280385359306825136587569085110256960487392157911296412919 (68790574322733353789092487174594866418670893295029170296485893186408773988152, 182, 0, 176, 183, 359)
4 11023636690140810652109138350092873223350285162460178239756198575491250070806 (79574857778441618190374605398778396331605356310850458929090676313728623789452, 186, 0, 163, 191, 354)
5 48673899963758261337290584766356417468862845552291965980487962550310458587722 (43493957935191599280071520123333639912283544665443371757241332743464486359707, 183, 0, 177, 184, 361)
6 13375659417630302272707017036247711675902473928798522670864387374916032135255 (91371322322092500654928225171492304811395845400276376022117681208712975765739, 174, 0, 178, 179, 357)
7 18933447562660498799151803539411934730546092368277561544957971138167446797703 (58387591255298048564603862915862446716810659550968633559759220573583656442311, 185, 0, 195, 167, 362)
#
#
#

Test-runs shows that:
1) step5 cannot be skipped as it was suggested by j2002ba2, because usually it will lead to infinite loop, especially for small numbers. try to send as input 12.
2) algortihm has internal bug leading to output results bigger than 256bits. try to send as input 10321180907436651991825056156721508500075068309065038805868074283166084193390. I've placed a fix - to send back moduled resuts.
3) itersum shows that it's quite impossible to get more than 512 iterations, I feel 384 is the biggest possible.
keep in mind that iter2 and iter3 loops are not stand-alone, they are running internally in iterx loop.

In fact I am not sure if it's good idea to use itersum=iter2+iter3 as iterations indicator. possibly it's more correct to pay attention only to iterx, or to largest iter.


Code:
#code by hakabit supraspecially for gmaxwell
#algorithm source: http://cacr.uwaterloo.ca/hac/about/chap14.pdf, page 608, section "14.61 Algorithm Binary extended gcd algorithm"

import random
#print(random.getrandbits(256))

def ext_binary_gcd(x, y):

    g = 1

    iterx = 0
    iter1 = 0
    iter2 = 0
    iter3 = 0

    while (x % 2 == 0) and (y % 2 == 0) :    # yeah, funny cycle when one number is prime :-)

        iter1 = iter1 + 1

        x = x // 2
        y = y // 2
        g = 2 * g

    u = x
    v = y
    A = 1
    B = 0
    C = 0
    D = 1

    while (u != 0) :

        iterx = iterx + 1

        #step4
        while (u % 2 == 0) :

            iter2 = iter2 + 1

            u = u // 2

            if (A % 2 == 0) and (B % 2 == 0) :

                A = A // 2
                B = B // 2

            else :

                A = (A + y) // 2
                B = (B - x) // 2

        while (v % 2 == 0) :

            iter3 = iter3 + 1

            v = v // 2

            if (C % 2 == 0) and (D % 2 == 0) :

                C = C // 2
                D = D // 2

            else :

                C = (C + y) // 2
                D = (D - x) // 2

        if (u >= v) :

            u = u - v
            A = A - C
            B = B - D

        else :

            v = v - u
            C = C - A
            D = D - B

    #a = C
    #b = D

    pp = 2 ** 256 - 2 ** 32 - 977
    b = D % pp   #BLM-rig

    #return (a, b, g*v)
    return (b, iterx, iter1, iter2, iter3, iter2 + iter3)

p = 2**256 - 2**32 - 977

#a=12
#res=ext_binary_gcd(p,a)
#print ("initial test, should be in=12, out=48246703848881748093154577086953294938862493610683568349773993336628681113193")
#print ("in (out, iterbig, iter1, iter2, iter3): ",a,res)



print("test# in (out, iterx, iter1, iter2, iter3, itersum): ")
for i in range(1000):

    #print(i)
    ii = random.getrandbits(256)
    ii = ii % p

    if (ii < 2):
        ii = ii + 2

    res = ext_binary_gcd(p, ii)
    print(i, ii, res)

#end
staff
Activity: 4284
Merit: 8808
760 cycle: 16,035,806 gates
765 cycle: 16,141,271 gates

this is close to 8.5 times more gates than the wrong estimate
This sounds more reasonable.
staff
Activity: 4284
Merit: 8808
In any case I found proofs how to calculate bound limits both for non-binary and binary GCD.
While I see that You was writing that "AFAIK no one knows how to compute an tight bound for this kind of algorithm, but useful upper limits exist which is all you need."
(BTW, if You are intrested in proof details, then I will try to find initial Ishmukhametov's articles and post them here. Just let me know.)

Calculations for 256bits numbers shows that "worst-scenario cannot be bigger than 350 iterations for non-binary GCD"
and that "worst-scenario should be no smaller than 512 iterations for binary GCD."
As result it means that "it is guaranteed to get result in no more than 512 iterations for worst-scenario (both for binary both for non-binary GCD)".
By a "tight bound" I mean the exact number of iterations which is enough for the worst input, but no longer.  Instead, what we can compute is a loose bound: a number which is large enough for any input but might be too large.

"worst-scenario should be no smaller than 512 iterations for binary GCD." might well be true, but "no smaller than" is not useful to you. For determining the circuit size you need to know that the worst case is no larger than some size.

As an aside, I found some random binary GCD from an internet search and it typically needed around a bit over 1000 iterations for random inputs and the secp256k1 field prime. (I would have tried the one from j2002ba2's earlier post but it gave wrong results when I tested it, though perhaps I made an error while transliterating it).

Quote
You are speaking about GCD algorithm with 735iterations. Why do we need 735 if 512 enough ? My answer - lot of them are dummy(fake).
They deliberately(willfully) added dummy iterations generated by quasi-random algorithm as technique to fight against side-channel attacks.
Nope that is just incorrect.

There aren't any dummy iterations except in the sense that on some numbers it takes fewer than the maximum so there will be some iterations that don't do anything useful-- but those iterations will still be there in the logic implementation. ... also in the sense that the known maximum bound isn't tight, even on the worst number it probably takes a few less.

The algorithm of the safegcd paper takes more iterations than some other constant time binary egcds because it is a least-significant bit algorithm:  each iteration only looks at and updates the least significant bits of their numbers.  In software with 32/64 bit numbers this ends up being a LOT faster than versions that have to update the whole 256 bit numbers at each step.

Implemented as raw logic I suspect a MSB binary egcd will be more efficient because you can shrink the numbers that it works on a bit at a time as their magnitude is guaranteed to be decreased by the algorithm.
full member
Activity: 206
Merit: 447
For binary euclidean gcd the maximum number of iterations is <=760.

To achieve it let u is zero with msb=1. Then 6 does 255 cycles (the number of zeroes), and we end with u=1. Next 21 and 12 are alternated, except for the few zero bits in p. 12 is executed 255 times, and 21 - 249. Finally u and v are both 1, and 18 is executed.

Another way to look at it is both 6 and 12 have to be executed 255 times max. This limits either 18 or 21 to be executed at most 255 times. This gives <=765 cycles.

Even A tends towards zero, odd A tends towards p. After every execution of 18 u is even and 6 is executed at least once. Same for 21 and 12. Then A and C are between -2p and 2p (after 18/21). 256+1 bits plus 1 sign bit, for toal 258 bits.


1-bit full-adder: 2*XOR + 2*AND + 1*OR
1-bit half-adder: 1*XOR + 1*AND
1-bit full-add-to-zero: 1*XOR + 1*AND
1-bit full-add-to-one: 1*XNOR + 1*OR
1-bit full-subtract: 2*XOR + 2*AND + 1*OR + 2*NOT
1-bit half-subtract: 1*XOR + 1*AND + 1*NOT
1-bit full-sub-zero: 1*XOR + 1*AND + 1*NOT
1-bit full-sub-one: 1*XNOR + 1*OR + 1*NOT
1-bit half-add-to-one: 1*NOT + wire
1-bit half-add-to-one-drop-lsb: wire

258-bit add to p drop lsb (A+p)/2: 249*full-add-to-one + 1*half-add-to-one-drop-lsb + 8*full-add-to-zero
  = 249*XNOR + 249*OR + 8*XOR + 8*AND
256-bit sub: 255*full-subtract + 1*half-subtract
  = 511*XOR + 511*AND + 255*OR + 511*NOT
258-bit sub: 258*full-subtract
  = 516*XOR + 516*AND + 258*OR + 516*NOT
256-bit mux: 768*NAND + 1*NOT
258-bit mux: 774*NAND + 1*NOT
256-bit compare ge: 256*XNOR + 256*AND + 256*NOT +   255*AND +   255*AND + 255*NOT + 255*AND
  = 256*XNOR + 1021*AND + 511*NOT

We have the following possible operations in the cycle:
1. 256-bit check for zero: 255*OR (+ 1*NOT)   <- not omitted
2. 256-bit >=: 256*XNOR + 1021*AND + 511*NOT
3. 256-bit subtraction: 511*XOR + 511*AND + 255*OR + 511*NOT
4. 258-bit add p, div 2: 249*XNOR + 249*OR + 8*XOR + 8*AND
5. 258-bit subtraction: 516*XOR + 516*AND + 258*OR + 516*NOT
3,4,5 are done twice

Control:
u1 = not 1. & not u[0]              - 1*AND + 1*NOT
v1 = not 1. & u[0] & not v[0]       - 2*AND + 1*NOT
u2 = not 1. & u[0] & v[0] & 2.      - 3*AND
v2 = not 1. & u[0] & v[0] & not 2.  - 1*AND   (repeat from up)
u = not (u1 | u2)                   - 1*OR + 1*NOT
v = not (v1 | v2)                   - 1*OR + 1*NOT
A1 = u1 & not A[0]                  - 1*AND + 1*NOT
A2 = u1 & A[0]                      - 1*AND
A3 = u2
A = u
C1 = v1 & not C[0]                  - 1*AND + 1*NOT
C2 = v1 & C[0]                      - 1*AND
C3 = v2
C = v
= 11*AND + 2*OR + 6*NOT

One cycle:
1. 255*OR
2. 256*XNOR + 1021*AND + 511*NOT
3. 1022*XOR + 1022*AND + 510*OR + 1022*NOT
4. 498*XNOR + 498*OR + 16*XOR + 16*AND
5. 1032*XOR + 1032*AND + 516*OR + 1032*NOT
ctrl. 11*AND + 2*OR + 6*NOT
mux. 10800*NAND + 14*NOT
= 754*XNOR + 2070*XOR + 3102*AND + 1782*OR + 10800*NAND + 2585*NOT

760 cycles: 573040*XNOR + 1573200*XOR + 2357520*AND + 1354320*OR + 8208000*NAND + 1964600*NOT
765 cycles: 576810*XNOR + 1583550*XOR + 2373030*AND + 1363230*OR + 8262000*NAND + 1977525*NOT

---

C mod p:
-2p <= C <= 2p, hence two times add to p, and two times add to -p
256-bit add to p: 249*full-add-to-one + 1*half-add-to-one + 6*full-add-to-zero
  = 249*XNOR + 249*OR + 1*NOT + 6*XOR + 6*AND
256-bit add to -p: 248*full-add-to-zero + 1*half-add-to-one + 7*full-add-to-one:
  = 248*XOR + 248*AND + 1*NOT + 7*XNOR + 7*OR

selecting the output:
257-bit mux: 771*NAND + 1*NOT
256-bit mux: 768*NAND + 1*NOT

control the output:
omitted, <16 elements

= 512*XNOR + 508*XOR + 508*AND + 512*OR + 3078*NAND + 8*NOT

---

total:
760 cycle: 573552*XNOR + 1573708*XOR + 2358028*AND + 1354832*OR + 8211078*NAND + 1964608*NOT
765 cycle: 577322*XNOR + 1584058*XOR + 2373538*AND + 1363742*OR + 8265078*NAND + 1977533*NOT

760 cycle: 16,035,806 gates
765 cycle: 16,141,271 gates

this is close to 8.5 times more gates than the wrong estimate
jr. member
Activity: 42
Merit: 18
In cryptography constant-time algorithms are used to prevent side-channel attacks.
So of course constant-time GCD algorithm can contain 735 and even more iterations - most of them are dummy(fake, useless).

In the case of an unrolled logic implementation, *all* candidates will be constant time.  Algorithms like binary gcds are made constant time by unrolling to their maximum number of operations.

If you allow memory and looping then the problem can be computed with an extremely small number of gates... (just the smallest computer with an ALU you can find, plus enough gates for them memory for a point, the input, a couple hundred bits of temporaries, and the program itself)

Quote
I mean (and had to write) that binary version cannot be worstier than non-binary.
That is how I understood what you wrote. But all of the binary gcd variants will perform (many) more iterations than the classical GCD. You cannot use the behaviour of the classical gcd to reason about the iteration counts of some binary gcd.


Now I need to agree that I understand GCD algos in not so good way as I was thinking. Need some time to analyse them additionally.

In any case I found proofs how to calculate bound limits both for non-binary and binary GCD.
While I see that You was writing that "AFAIK no one knows how to compute an tight bound for this kind of algorithm, but useful upper limits exist which is all you need."
(BTW, if You are intrested in proof details, then I will try to find initial Ishmukhametov's articles and post them here. Just let me know.)

Calculations for 256bits numbers shows that "worst-scenario cannot be bigger than 350 iterations for non-binary GCD"
and that "worst-scenario should be no smaller than 512 iterations for binary GCD."
As result it means that "it is guaranteed to get result in no more than 512 iterations for worst-scenario (both for binary both for non-binary GCD)".

You are speaking about GCD algorithm with 735iterations. Why do we need 735 if 512 enough ? My answer - lot of them are dummy(fake).
They deliberately(willfully) added dummy iterations generated by quasi-random algorithm as technique to fight against side-channel attacks.
Either they was unrolling cycles, but unrolled them in non-correct way, adding useless iterations without special will for doing that.

And what do You think ?




P.S.
Of course, any algorithm can be calculated using only one NAND(or NOR) logic gate + a lot of memory and muxes.

I would like to write much about unrolled logic and constant-time, but prefer to wait Your answer related to iterations overcounting.
staff
Activity: 4284
Merit: 8808
In cryptography constant-time algorithms are used to prevent side-channel attacks.
So of course constant-time GCD algorithm can contain 735 and even more iterations - most of them are dummy(fake, useless).

In the case of an unrolled logic implementation, *all* candidates will be constant time.  Algorithms like binary gcds are made constant time by unrolling to their maximum number of operations.

If you allow memory and looping then the problem can be computed with an extremely small number of gates... (just the smallest computer with an ALU you can find, plus enough gates for them memory for a point, the input, a couple hundred bits of temporaries, and the program itself)

Quote
I mean (and had to write) that binary version cannot be worstier than non-binary.
That is how I understood what you wrote. But all of the binary gcd variants will perform (many) more iterations than the classical GCD. You cannot use the behaviour of the classical gcd to reason about the iteration counts of some binary gcd.
jr. member
Activity: 42
Merit: 18
I'mmm sorrry.

Calculating 2/ln(k)*ln(A) I've made kid's mistake. Not divided 2 to, but multiplied to 2.

So... 2/(0.7) * 177 = 506 max iterations.  I'm a little bit unobservant. Embarrassed Embarrassed

2/ln(2)*ln(2^256) = (2/0,6931471805599453)*177,445678223346 = 512,00000000000000923324826168937

So... 512 max iterations.

 Roll Eyes Roll Eyes Roll Eyes Roll Eyes Roll Eyes

But now I am a little bit puzzled. 390 max iterations for non-binary and 506 512 max iterations for binary.
I hope maximizing iterations is in interchange with lowering logic gates(operations).
jr. member
Activity: 42
Merit: 18
However I was thinking that worstiest EGCD(or BEGCD) scenario cannot be worstier than GCD.
No, that is entirely untrue, unfortunately.

Binary GCD is faster to use not because it makes fewer iterations but because there isn't a very expensive division in each iteration.


I am sorry, I messed up with abbreviations.
I mean (and had to write) that binary version cannot be worstier than non-binary.

I provided proof that non-binary version has max 390 iterations. So I was trying to say, that there are no needs to process 735iterations for binary version.
But 390 - was upper(but not lower) bound limiter. Today I provided proof for 248 max iterations. (and as I understand that's lower bound limiter ?!)

What about 735... in other post You provided URL to PDF on cr.yp.to website. This PDF named "Fast constant-time
gcd computation and modular inversion".

In cryptography constant-time algorithms are used to prevent side-channel attacks.
So of course constant-time GCD algorithm can contain 735 and even more iterations - most of them are dummy(fake, useless).
staff
Activity: 4284
Merit: 8808
However I was thinking that worstiest EGCD(or BEGCD) scenario cannot be worstier than GCD.
No, that is entirely untrue, unfortunately.

Binary GCD is faster to use not because it makes fewer iterations but because there isn't a very expensive division in each iteration.
jr. member
Activity: 42
Merit: 18
The algorithm you are discussing is a binary (extended) gcd, not the "original" Euclid's algorithm.  The original Euclid's has an arbitrary division (well a modular reduction, but same deal) in the inner loop.



Yes, I know difference. Even for same input EGCD(or BEGCD) probably have different amount of iterations compared to GCD.
However I was thinking that worstiest EGCD(or BEGCD) scenario cannot be worstier than GCD.

Here
https://tuengr.com/V10/001.pdf
I found formulae(based on proved by Ishmukhametov formulaes) for EGCD.
2/ln(k)*ln(A)
it gives 2*0.7*177 = 248max iterations(2 to 256 bits numbers).



Also here
http://www.joppebos.com/files/CTInversion.pdf
https://link.springer.com/content/pdf/10.1007%2F3-540-36400-5_6.pdf
https://www.researchgate.net/publication/3044233_The_Montgomery_modular_inverse_-_Revisited
https://www.researchgate.net/publication/3387259_Improved_Montgomery_modular_inverse_algorithm

I found one more algo called "Montgomery Inverse" with additional phase to calculate back from montgomery-domain.
Proved(?!) fomulae: log2(a) < 2log2(a)
it gives min 256 and max 512 iterations for each phase(if 256bits numbers).

Each iteration contain 5*256bits additions with signed numbers.(substraction ~= addition)

However each second iteration amount of logical gates can be lowered by 1bit.

So in total we have 512 * 5 * half = 1280 additions for 1inverse. (not included muxes)

I think that's ideal algo for point inverse if we are planning to calculate secp256k1 using millions, but not trillions logical operations.
staff
Activity: 4284
Merit: 8808
The algorithm you are discussing is a binary (extended) gcd, not the "original" Euclid's algorithm.  The original Euclid's has an arbitrary division (well a modular reduction, but same deal) in the inner loop.
jr. member
Activity: 42
Merit: 18
What remains is the maximum number of iterations of the cycle, and how many bits (n) are needed for C and A.

If p is odd - and it should be, because if were even we could make an optimization that returns 1 or 2 depending on the parity of k [this falls apart if k and p have a composite factor gcd though, like 6 or 10] - we can take advantage of the fact that even(A) and even(C) will only run once in the inner while loops. So for the entire even(u) and even(v) iterations, each addition of A+p and C+p should increase their length by no more than ceil(log2(p)) bits. That means C+p will only be computed twice because 0+p => odd, 0+p+p => even. A+p will also only be done twice: at 1, and at A-C or C-A, which is (1+p)/arbitrary-p or p-(1+p)/arbitrary depending on which is larger, keeping in mind 1+p is always even. The second A+p addition will therefore be either (1+p)/arbitrary or p - (1+p)/arbitrary. At any rate, less than 1+p.

So I think we can put a ceiling for the number of bits at ceil(log2(1+p)) bits for C and A [this is for the special case where gcd(k,p) is a single prime >2].

The maximum #iterations problem still stands though - maybe there's some research papers on the internet documenting this?

but I found one more idea on https://en.wikipedia.org/wiki/Euclidean_algorithm#Worst-case

"Worst-case
If the Euclidean algorithm requires N steps for a pair of natural numbers a > b > 0, the smallest values of a and b for which this is true are the Fibonacci numbers FN+2 and FN+1, respectively.[95] More precisely, if the Euclidean algorithm requires N steps for the pair a > b, then one has a ≥ FN+2 and b ≥ FN+1. This can be shown by induction.[96] If N = 1, b divides a with no remainder; the smallest natural numbers for which this is true is b = 1 and a = 2, which are F2 and F3, respectively. Now assume that the result holds for all values of N up to M − 1. The first step of the M-step algorithm is a = q0b + r0, and the Euclidean algorithm requires M − 1 steps for the pair b > r0. By induction hypothesis, one has b ≥ FM+1 and r0 ≥ FM. Therefore, a = q0b + r0 ≥ b + r0 ≥ FM+1 + FM = FM+2, which is the desired inequality. This proof, published by Gabriel Lamé in 1844, represents the beginning of computational complexity theory,[97] and also the first practical application of the Fibonacci numbers.[95]

This result suffices to show that the number of steps in Euclid's algorithm can never be more than five times the number of its digits (base 10).[98] For if the algorithm requires N steps, then b is greater than or equal to FN+1 which in turn is greater than or equal to φN−1, where φ is the golden ratio. Since b ≥ φN−1, then N − 1 ≤ logφb. Since log10φ > 1/5, (N − 1)/5 < log10φ logφb = log10b. Thus, N ≤ 5 log10b. Thus, the Euclidean algorithm always needs less than O(h) divisions, where h is the number of digits in the smaller number b."



in secp256k1 we are working with module m=
115792089237316195423570985008687907853269984665640564039457584007908834671663

and 256bits 2^256=
115792089237316195423570985008687907853269984665640564039457584007913129639936

in any case that's the maximaiest numbers. they both contain 78digits.

So 78*5= 390 maximum iterations.

staff
Activity: 4284
Merit: 8808
The maximum #iterations problem still stands though - maybe there's some research papers on the internet documenting this?
As I mentioned, it's not very simple.

https://gcd.cr.yp.to/safegcd-20190413.pdf  Covers doing this for a somewhat different GCD algorithm (a variant which is designed to work only on the LSBs of values which makes it much faster in software).

The bounds analysis in the paper is extremely complicated.

Pieter came up with a simpler to understand approach that also produces a somewhat tighter bound (for safegcd): The idea comes from the fact that the steps of the algorithm are all affine transformations on the inputs.  This means that if instead feeding a value into a step of the algorithm you feed in a convex hull that captures all the possible inputs  andthe result is a convex hull that contains all the possible outputs of that step (and maybe a bit more). You can 'feed in' a hull by just sending its vertices through because it always performs an affine transformation.  You handle the branches at each iteration by executing all of them and taking the union of the resulting hulls. Keep running until the hull shrinks below 1 which means that every input would have finished by then. It's kind of like you're concurrently running the algorithm on all inputs at once. At least for safegcd this actually converges.

We know from testing with small inputs that the result is still somewhat conservative rather than exact.  AFAIK no one knows how to compute an tight bound for this kind of algorithm, but useful upper limits exist which is all you need.

You could, of course, just use safegcd... though I think a MSB binary GCD will be more efficient implemented in raw logic.  I don't know a good cite for one off  the top of my head, there probably is a cite in the safegcd paper.  IIRC, GMP has a constant time GCD that implements one of the binary MSB algorithms.

Depending on the application it might be acceptable to even make it sometimes be incorrect. Smiley

E.g. if this were being used for mining you could pick a bound that is wrong less than 1:1000000 times based on experimental testing and that would probably be fine. Smiley


Quote
Now, You are saying that 256 iterations not enough, and we need 735.
No, I was saying a different related algorithm needed 735.  I don't know what the bound is on that particular one. I just know it's going to be greater than 256.   (if you look at the general structure though you should see that it shouldn't usually shave off more than one bit per iteration (only when the values are extremely close will it shave off more)-- but it's not guaranteed to even do that, so that should give you an intuition as to why 256 is far too low).

Quote
You see, Jacobian has 17mul

Well, don't use needlessly inefficient algorithms.  Cheesy  One of your inputs is always affine and the group law can be algebraically simplified for secp256k1.

https://github.com/bitcoin-core/secp256k1/blob/master/src/group_impl.h#L389

(for point*scalar you will never get an infinity in the calculation either.)

For optimizing classical(non-modular) multiplications I've heard about Karatsuba algorithm.

Sure, and toom-cook. Some useful overview in https://cr.yp.to/lineartime/multapps-20080515.pdf.

(squaring is faster than multiplying for pretty much the same reason that karatsuba is a speedup)

Montgomery helps when you have consecutive squaring or multiplying -- saving you from having to reduce at the expense of doubling the size of your numbers. But you need a conversion to/from Montgomery before performing addition/subtraction -- so I wouldn't expect Montgomery to be a win for point addition.

The secp256k1 field order is somewhat more efficient to modular reduce by than an arbitrary number.  There is a complicated implementation in libsecp256k1 though it's optimized for processors and doesn't make the right tradeoffs for dumb logic.

As far as jacobian vs affine.  If you're able to operate on multiple points in parallel and share work in the inversion then affine will be faster for sure. If you cannot, then I remain sceptical that you're not miscounting something.
jr. member
Activity: 42
Merit: 18
What remains is the maximum number of iterations of the cycle, and how many bits (n) are needed for C and A.

If p is odd - and it should be, because if were even we could make an optimization that returns 1 or 2 depending on the parity of k [this falls apart if k and p have a composite factor gcd though, like 6 or 10] - we can take advantage of the fact that even(A) and even(C) will only run once in the inner while loops. So for the entire even(u) and even(v) iterations, each addition of A+p and C+p should increase their length by no more than ceil(log2(p)) bits. That means C+p will only be computed twice because 0+p => odd, 0+p+p => even. A+p will also only be done twice: at 1, and at A-C or C-A, which is (1+p)/arbitrary-p or p-(1+p)/arbitrary depending on which is larger, keeping in mind 1+p is always even. The second A+p addition will therefore be either (1+p)/arbitrary or p - (1+p)/arbitrary. At any rate, less than 1+p.

So I think we can put a ceiling for the number of bits at ceil(log2(1+p)) bits for C and A [this is for the special case where gcd(k,p) is a single prime >2].

The maximum #iterations problem still stands though - maybe there's some research papers on the internet documenting this?

on https://math.stackexchange.com/questions/1483118/proving-the-number-of-iterations-in-the-euclidean-algorithm
I found great post


"In two steps we move from (m,n) to (m′=n,n′=mmodn) and then to (m′′=n′,n′′=m′modn′). Assume n≤2k. Then m′,n′≤2k and n′′≤min{m′,n′,|m′−n′|}. Henc n′′>2k−1 could only happen if both |m′−n′|>2k−1 and min{n′,m′}>2k−1, which implies max{m′,n′}>2k−1+2k−1=2k, contradiction. Hence after two steps n≤2k becomes n′′≤2k−1
Hence if n≤2k initially, then after 2k steps we arrive at a situation with n≤20=1. Since it takes one additional step to reach n=0 and terminate, and since initially we may have n>m, it seems we may need up to 2k+2 rounds.

Indeed, the desired bounds are not fully correct, as for example m=1, n=42 (so k=0) requires 2 instead of just 2⋅0=0 steps: (1,42)↦(42,1)↦(1,0).

Remark: A deeper analysis should exhibit that the behaviour is governed by powers of the golden ration ϕ≈1.618 rather than 2–√≈1.414."



In secp256k1 we are working with field where k=256. so result should be maximum 2*256+2 iterations = 514.

But also topicstarter speaks about golden ratio and I don't understand do we need 514 to multiply to 1.618(=832) or to divide with 1.618(=316) ?!

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
What remains is the maximum number of iterations of the cycle, and how many bits (n) are needed for C and A.

If p is odd - and it should be, because if were even we could make an optimization that returns 1 or 2 depending on the parity of k [this falls apart if k and p have a composite factor gcd though, like 6 or 10] - we can take advantage of the fact that even(A) and even(C) will only run once in the inner while loops. So for the entire even(u) and even(v) iterations, each addition of A+p and C+p should increase their length by no more than ceil(log2(p)) bits. That means C+p will only be computed twice because 0+p => odd, 0+p+p => even. A+p will also only be done twice: at 1, and at A-C or C-A, which is (1+p)/arbitrary-p or p-(1+p)/arbitrary depending on which is larger, keeping in mind 1+p is always even. The second A+p addition will therefore be either (1+p)/arbitrary or p - (1+p)/arbitrary. At any rate, less than 1+p.

So I think we can put a ceiling for the number of bits at ceil(log2(1+p)) bits for C and A [this is for the special case where gcd(k,p) is a single prime >2].

The maximum #iterations problem still stands though - maybe there's some research papers on the internet documenting this?
Pages:
Jump to: