I still can't understand how You optimizing gates count by combining bits.
The most naive point addition is with memory 1 point (G). Then we double the generator and add points appropriately. However all doubled points are constant, hence we could precompute all 255 doublings. The result is 256 points memory, no doubling. Since point addition is more expensive compared to implementing memory through multiplexers, we could go further. Split the private key in groups of 2 bits, precompute for each group the four combinations (resulting in 4*128 points memory), and feed the appropriate points to 127 point adders.
1. It is quite possible, that I have a mistake here, missing one multiplication. Anyway Jacobian+affine performs better.
2. "Guide to Elliptic Curve Cryptography", Algorithm 3.22. It has 8 multiplications, 3 squares, 6 subtractions.
For binary euclidean gcd I get 760 iterations, when running with input 0x8000...00 and p.
Big thanks for detailed answer !
Yeah, I understand Your idea. When I was constructing my trillionized scheme I was estimating such grouping.
I got:
group by 1 = 256 pre-computations
group by 2 = about 32768 pre-computations
group by 3 = about 2.7mln pre-computation
after seen agressively exponential grow I understood that it's bad idea.
Now I understand that I've made a big combinatorical logic mistake.
After mistake detected(thanks to You), now I can say, of course such grouping is great idea, and yes, more than likely optimal grouping is by ~10bits.
I found "Algorithm 3.22". I see it's hybrid for affine+Jacobian. But I can't understand one thing.
Ok, let's pre-compute points grouping up to 11bits.
Then let's take privkey 1
11111111111111111 (2+16 leading ones) {skipped some bits} 1
00000000000000001 (16 leeding zeros)
How can be pubkey calculated using this algorithm without re-converts affine<->Jacobian ?!
(from other side, I can agree, that using privkey with more than 11 leading same bits is not good idea, but...)
Now what about iterations. Ouch... my test-runs shows about 384iterations. but... for such multi-conditional multi-loop algorith it's quite complex to count iterations.
My idea is that I need to place special triggers on each loop, then log results to array, then analyse array in detecting consequental and-or parallel triggers switching.
But I am not too good in Python or Ruby, so it will took a time to implement...
As result I suppose that there will be either 384 or 2*384=768 max iterations( if max input number = 2^256).
of, course in secp256k1 module is slightly smaller than 2^256, so there can be sligthly smaller than 768iters...
...but should be 512 !