Looking at your EGCD code did help me to debug the first reference I attempted to use (this one: https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm#Modular_integers).
Here's the pseudocode from that link:
t := 0; newt := 1
r := n; newr := a
while newr ≠ 0 do
quotient := r div newr
(t, newt) := (newt, t − quotient × newt)
(r, newr) := (newr, r − quotient × newr)
if r > 1 then
return "a is not invertible"
if t < 0 then
t := t + n
return t
There's a pretty subtle issue with that code: with the initializers in that order (0, 1, n, a) you have to be careful not to feed it negative a's (which can come up when calculating slopes during point addition, for example). If you flip the initializers pairwise (1, 0, a, n) then it will work correctly for negative a's, too (saving you a modulo operation before using the function).
What do you have in mind? An assembly version of something like this: https://www-users.cse.umn.edu/~garrett/crypto/Code/FastPow_Python.html?
There's some low-hanging fruit there (which I'm sure you're aware of), like turning that divide into a shift and the evenness test into a bitwise and (which would only have to be applied to the low-order limb), but using private key inversion as an example: that algorithm would take 451 "steps", with about half (196) of the steps comprising 2 comparisons, 1 and, 1 multiplication, 1 modulo and 1 decrement, and the other half (255) of the steps comprising 2 comparisons, 1 square, 1 modulo and 1 shift. The extended Euclidean algorithm would take ~150 steps (on average), with each step comprising 1 comparison, 1 divide, 2 subtractions and 2 multiplies. I'm sure that both algorithms can still be improved upon, but here's the little script I put together to count the steps:
import random
secp256k1_group_order = 2**256 - 0x14551231950b75fc4402da1732fc9bebf
def invert_private_key_algorithm_a(scalar: int) -> Tuple[int, int]:
# reference: https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm#Pseudocode
r0, r1 = scalar, secp256k1_group_order
s0, s1 = 1, 0
steps = 0
while r1 != 0:
quotient = r0 // r1
r0, r1 = r1, r0 - quotient * r1
s0, s1 = s1, s0 - quotient * s1
steps += 1
return s0 if s0 >= 0 else s0 + secp256k1_group_order, steps
def invert_private_key_algorithm_b(scalar: int) -> Tuple[int, int]:
# reference: https://www-users.cse.umn.edu/~garrett/crypto/Code/FastPow_Python.html
x, m, y = scalar, secp256k1_group_order - 2, 1
steps = 0
while m != 0:
if m & 1 != 0:
y = (x * y) % secp256k1_group_order
m -= 1
else:
x = (x * x) % secp256k1_group_order
m >>= 1
steps += 1
return y, steps
def main() -> None:
scalar = random.randrange(1, secp256k1_group_order)
inverted_a, steps_a = invert_private_key_algorithm_a(scalar)
inverted_b, steps_b = invert_private_key_algorithm_b(scalar)
assert inverted_a == inverted_b
print(f'Algorithm A took {steps_a} steps.')
print(f'Algorithm B took {steps_b} steps.')
if __name__ == '__main__':
main()
I think it's going to be tough to get an assembly version of algorithm B to beat an assembly version of algorithm A. I mean, not having to divide is nice, but you still have those modulo reductions to worry about, maybe they can be sped up with carefully-placed conditional subtractions?
I do like algorithm B from a security perspective, though; if I was trying to make a safe (constant-time) modular inverse, a fixed number of steps (with respect to the modulus) seems like a better starting point than the extended Euclidean algorithm.