So, What is the theoretical minimum number of logical operations an ASIC needs to perform to compute privkey->pubkey secp256k1 ?
With some additional limitations:
only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used.
only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.
A minimal example:
there might be mistakes256-bit adder: 255*(2*XOR + 2*AND +1*OR) + 1*XOR + 1*AND = 511*XOR + 511*AND + 256*OR
257-bit adder: 513*XOR + 513*AND + 257*OR
256-bit adder with carry input: 512*XOR + 512*AND + 256*OR
256-bit sub: 512*XOR + 512*AND + 256*OR + 256*NOT
256-bit multiplexer: 512*AND + 256*OR + 1*NOT
add-mod-p: 0x400*XOR + 0x600*AND + 0x301*OR + 1*NOT
sub-mod-p: 0x401*XOR + 0x601*AND + 0x301*OR + 0x101*NOT
mul-mod-p: 0x100*add-mod-p + 0x100*sub-mod-p + 0x200*multiplex
= 0x80100*XOR + 0xe0100*AND + 0x80200*OR + 0x30200*NOT
inversion: 0x100*add + 0x300*sub + 0x200*multiplex
= 0x7ff00*XOR + 0xbff00*AND + 0x60000*OR + 0x30200*NOT
Looks like inversion is less costly than mul-mod-p. (There might be more efficient mul-mod-p method though.)
We'll use affine coordinates then.
In affine coordinates point addition is 1*inversion + 2*mul-mod-p + 6*sub-mod-p
= 0x181906*XOR + 0x280706*AND + 0x161606*OR + 0x90906*NOT
256 point-additions + 256 multiplexers: 0x18190600*XOR + 0x28090600*AND + 0x14170600*OR + 0x9090700*NOT
some missing logic for the leading zero bits: less than 4K
missing inversion logic: less than 1M
total: 0x5f421900= 1,598,167,296 gates.I see we are getting almost identical results for 1round, but using slightly different terminology.
My version is "So 1round = 0SUMM + 7SUBM + 3MULM + 1POWM".
Your version is "In affine coordinates point addition is 1*inversion + 2*mul-mod-p + 6*sub-mod-p"
SUBM = sub-mod-p
MULM = mul-mod-p
My POWM means that I am using POWM as of Euler's theorem for modular inversion calculations.
Your "inversion" leads to "Euclidean Binary extended gcd algorithm".
Now what about differences:
1) As I see for SUMM(add-mod-p) I provided: 2000 NOT, 3000 AND, 7000 XOR logic gates.
You provide: 0x400*XOR + 0x600*AND + 0x301*OR + 1*NOT (in HEX)
Or 1024*XOR + 1536*AND + 769*OR + 1*NOT(in DEC)
May be I have memory faults and there was 200 NOT, 300 AND, 700 XOR logic gates.
In any case I need to make fresh research to count how many gates need to be used for 256bits modular full-adder.
2) As I understand I forgot to include multiplexors as switchers to pre-computated points.
3) In "14.61 Algorithm Binary extended gcd algorithm" step 6. I see "If u ≥ v then". This means we need to use 256bits comparator. As I understand You forgot to include it.
By the way, 256bits comparator can consume even more logical gates than 256bits full adder. (at this moment I am doing some research).
That's because adder uses only 1bit(1wire) for carry-out and carry-in,
while comparator uses either aggregated 2bit(2wires) or segregated 3bits(3wires) quasi-carry for carry-in and carry-out.
4) In "14.61 Algorithm Binary extended gcd algorithm" step 7. I see " otherwise, go to step 4.".
This means that algorithm is recursive. How many maximum recursions there can be ?! I suppose maximum 256.
This means we need to make dummy cycle with 256 iterations,
each iteration should contain all(!) "14.61 Algorithm Binary extended gcd algorithm" calculations except step 2. and step 5. (and also exclude step 1. 3. oops)
I feel after including 256iterations the result will be that Euclidean algorithm will consume almost the same amount of logic gates as for algorithm using Euler's theorem.
It's interesting note, that v=y=1. Yes, v=1, actually. What about y=1 it's not obvious for me from first sight, but I feel You are correct.
And here are some more comments:
1) "14.61 Algorithm Binary extended gcd algorithm" on input uses unsigned integers, but all futher operations are with signed integers. Operations with signed integers consumes 2-3 times more logic gates than operations with unsigned integers.
2) On step 4.2 I see "If A ≡ B ≡ 0 (mod 2) then". If A>0 and B>0 then it is very light operation which equal to "if A and B are both even" - we just need to verify if last bit=0.
But... there can be situations when A<0 and/or B<0. Is it enough to verify that last bit=1 for negative integers ?! At this moment I am not sure. I need some time to think.
3) You are saying "If it is less costly (less gates), you could group powers of G by 2, 3, or more bits, using more multiplexers."
Do You mean that privkey not always looks like
11
011111
01111
011 {skipped some bits} 11111
011111
0111
But also can looks like
1
001111
00111
0011 {skipped some bits} 11111
00111
00111 (reason for grouping powers of G by 2)
Or like
1
00011
00011
00011 {skipped some bits} 1111
00011
000111 (reason for grouping powers of G by 3)
If I understand Your idea correctly, then my opinion is that such grouping has many pros and cons.
I don't know if it less costly. I think it's needed to make many-many test-iterations to measure pros vs. cons.
4) I don't understand Your idea about "Combining several bits" and table provided with "0: 26 PA + 1024*26 mux = 182,767,728".
Do You mean that there will be stand-alone 10bits full-adder, with 10bits+10bits input and 10bits output, so we can seqeuntially passing to it A and B, and getting back result C=A+B ?
If yes, then it is against limitation "only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.".
You see, such trick will lower number of gates, but not lower number of logic operations, also such trick will make estimation of logic operations much harder.
Or do you mean logic gates with not 2input but with 10inputs ? Then it is against limitation "only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used."
Using not 2input gates make estimation of logic operations a little bit harder.
Or do You mean smth different ?!
Waiting Your answers and comments to proceed to final estimations.