Author

Topic: [Code] Modular multiplicative inverses (Python 3.8+) (Read 213 times)

hero member
Activity: 510
Merit: 4005
Have you seen this thread by any chance?
Yep. I thought about posting in that thread, but I figured this was interesting/useful enough to deserve its own topic. I was pleasantly surprised to learn that pow(x, -1, m) works as you would expect it to, so I'm guessing that other Python programmers will appreciate the heads-up.

Looking at your EGCD code did help me to debug the first reference I attempted to use (this one: https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm#Modular_integers).

Here's the pseudocode from that link:

Code:
function inverse(a, n)
    t := 0;     newt := 1
    r := n;     newr := a

    while newr ≠ 0 do
        quotient := r div newr
        (t, newt) := (newt, t − quotient × newt)
        (r, newr) := (newr, r − quotient × newr)

    if r > 1 then
        return "a is not invertible"
    if t < 0 then
        t := t + n

    return t

There's a pretty subtle issue with that code: with the initializers in that order (0, 1, n, a) you have to be careful not to feed it negative a's (which can come up when calculating slopes during point addition, for example). If you flip the initializers pairwise (1, 0, a, n) then it will work correctly for negative a's, too (saving you a modulo operation before using the function).

I actually think that with a fast mod and mul functions, you can make a very quick modinverse function in assembly using Fermat's little theorem.
You mean method B? I'm not enough of a math head to know if it's correct to refer to that approach by Fermat's name or Euler's, so I just went with what Wikipedia calls it: https://en.wikipedia.org/wiki/Modular_multiplicative_inverse#Using_Euler's_theorem.

What do you have in mind? An assembly version of something like this: https://www-users.cse.umn.edu/~garrett/crypto/Code/FastPow_Python.html?

There's some low-hanging fruit there (which I'm sure you're aware of), like turning that divide into a shift and the evenness test into a bitwise and (which would only have to be applied to the low-order limb), but using private key inversion as an example: that algorithm would take 451 "steps", with about half (196) of the steps comprising 2 comparisons, 1 and, 1 multiplication, 1 modulo and 1 decrement, and the other half (255) of the steps comprising 2 comparisons, 1 square, 1 modulo and 1 shift. The extended Euclidean algorithm would take ~150 steps (on average), with each step comprising 1 comparison, 1 divide, 2 subtractions and 2 multiplies. I'm sure that both algorithms can still be improved upon, but here's the little script I put together to count the steps:

Code:
from typing import Tuple

import random

secp256k1_group_order = 2**256 - 0x14551231950b75fc4402da1732fc9bebf

def invert_private_key_algorithm_a(scalar: int) -> Tuple[int, int]:

    # reference: https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm#Pseudocode

    r0, r1 = scalar, secp256k1_group_order

    s0, s1 = 1, 0

    steps = 0

    while r1 != 0:

        quotient = r0 // r1

        r0, r1 = r1, r0 - quotient * r1

        s0, s1 = s1, s0 - quotient * s1

        steps += 1

    return s0 if s0 >= 0 else s0 + secp256k1_group_order, steps

def invert_private_key_algorithm_b(scalar: int) -> Tuple[int, int]:

    # reference: https://www-users.cse.umn.edu/~garrett/crypto/Code/FastPow_Python.html

    x, m, y = scalar, secp256k1_group_order - 2, 1

    steps = 0

    while m != 0:

        if m & 1 != 0:

            y = (x * y) % secp256k1_group_order

            m -= 1

        else:

            x = (x * x) % secp256k1_group_order

            m >>= 1

        steps += 1

    return y, steps

def main() -> None:

    scalar = random.randrange(1, secp256k1_group_order)

    inverted_a, steps_a = invert_private_key_algorithm_a(scalar)

    inverted_b, steps_b = invert_private_key_algorithm_b(scalar)

    assert inverted_a == inverted_b

    print(f'Algorithm A took {steps_a} steps.')

    print(f'Algorithm B took {steps_b} steps.')

if __name__ == '__main__':

    main()

I think it's going to be tough to get an assembly version of algorithm B to beat an assembly version of algorithm A. I mean, not having to divide is nice, but you still have those modulo reductions to worry about, maybe they can be sped up with carefully-placed conditional subtractions?

I do like algorithm B from a security perspective, though; if I was trying to make a safe (constant-time) modular inverse, a fixed number of steps (with respect to the modulus) seems like a better starting point than the extended Euclidean algorithm.
legendary
Activity: 2464
Merit: 4419
🔐BitcoinMessage.Tools🔑
powmod from gmpy2 library works faster than in-built pow, tests confirm that:

Code:

from time import perf_counter
import gmpy2

import numpy as np


def modinv1(a, n):
    # reference: https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm#Pseudocode
    r0, r1 = a, n
    s0, s1 = 1, 0
    while r1 != 0:
        quotient = r0 // r1
        r0, r1 = r1, r0 - quotient * r1
        s0, s1 = s1, s0 - quotient * s1
    if r0 != 1:
        raise ValueError('x is not invertible for the given modulus')
    return s0 if s0 >= 0 else s0 + n


def modinv2(a, n):
    return pow(a, -1, n)  # requires Python 3.8+


def modinv3(a, n):
    return gmpy2.powmod(a, -1, n)
    

p = 2**256 - 2**32 - 2**9 - 2**8 - 2**7 - 2**6 - 2**4 - 1
a = 2*32670510020758816978083085130507043184471273380659243275938904335757337482424
start_time = perf_counter()
for i in range(100000):
    modinv1(a, p)
print('Result: ', modinv1(a, p))
end_time1 = perf_counter() - start_time
print('Euclidian: ', end_time1)

start_time = perf_counter()
for i in range(100000):
    modinv2(a, p)
print('Result: ', modinv2(a, p))
end_time2 = perf_counter() - start_time
print(end_time2)

start_time = perf_counter()
for i in range(100000):
    modinv3(a, p)
print('Result: ', modinv3(a, p))
end_time3 = perf_counter() - start_time
print(end_time3)


print(
    f"""pow 3.8 is {end_time1 / end_time2:.2f} times faster than Euclidian algorithm
    and {end_time2 / end_time3:.2f} times slower than powmod from gmpy2 library
    """
)

Output:

Code:
Result:  83174505189910067536517124096019359197644205712500122884473429251812128958118
Euclidian:  9.370973937000599
Result:  83174505189910067536517124096019359197644205712500122884473429251812128958118
4.19700585999999
Result:  83174505189910067536517124096019359197644205712500122884473429251812128958118
0.39307602500048233
pow 3.8 is 2.23 times faster than Euclidian algorithm
    and 10.68 times slower than powmod from gmpy2 library

As you can see, powmod function is roughly 10 times(!) faster than python in-built pow, which makes it a more reliable way to calculate modulo inverse. However, if you are unconcerned about the speed, pow works just fine.

More information can be found here: https://stackoverflow.com/questions/67664402/python-speed-up-powbase-exp-mod-for-fixed-exp-and-mod-or-with-vectorization
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Have you seen this thread by any chance?

I actually think that with a fast mod and mul functions, you can make a very quick modinverse function in assembly using Fermat's little theorem.
legendary
Activity: 2464
Merit: 4419
🔐BitcoinMessage.Tools🔑
Method C (exponentiation with -1)

Code:
def invert_c(x: int, m: int) -> int:
    return pow(x, -1, m)
What a neat method to calculate modulo inverse, I wasn't aware of it. Thank you for sharing it!
I did some tests and can confirm that it roughly 2.2 times faster than pure python implementation of Euclidean algorithm:
Code:
from time import perf_counter


def modinv1(a, n):
    return pow(a, -1, n)  # requires Python 3.8+


def modinv2(a, n):
    # reference: https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm#Pseudocode
    r0, r1 = a, n
    s0, s1 = 1, 0
    while r1 != 0:
        quotient = r0 // r1
        r0, r1 = r1, r0 - quotient * r1
        s0, s1 = s1, s0 - quotient * s1
    if r0 != 1:
        raise ValueError('x is not invertible for the given modulus')
    return s0 if s0 >= 0 else s0 + n


p = 2**256 - 2**32 - 2**9 - 2**8 - 2**7 - 2**6 - 2**4 - 1
a = 2*32670510020758816978083085130507043184471273380659243275938904335757337482424

start_time = perf_counter()
for i in range(100000):
    modinv2(a, p)
end_time1 = perf_counter() - start_time
print(end_time1)

start_time = perf_counter()
for i in range(100000):
    modinv1(a, p)
end_time2 = perf_counter() - start_time
print(end_time2)
print(
    f"pow 3.8 is {end_time1 / end_time2:.2f} faster than Euclidean algorithm")
My old laptop gives me the following results:

9.056113444999937
4.038740186000268
pow 3.8 is 2.24 faster than Euclidean algorithm

As far as I understand from your post, Extended Euclidean algorithm remains the fastest and most optimal way to calculate modulo inverse, which is why it is used in a new pow function (but as we all aware, implementation in C language runs faster than than in pure Python).
hero member
Activity: 510
Merit: 4005
I've been playing around with generating Bitcoin addresses from scratch in Python (will post about that soon) and ran into something interesting to do with calculating modular multiplicative inverses. Until recently, I'd only bumped into two ways of doing that, either using the extended Euclidean algorithm, or using Euler's theorem:

Method A (extended Euclidean algorithm)

Code:
def invert_a(x: int, m: int) -> int:
    # reference: https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm#Pseudocode
    r0, r1 = x, m
    s0, s1 = 1, 0
    while r1 != 0:
        quotient = r0 // r1
        r0, r1 = r1, r0 - quotient * r1
        s0, s1 = s1, s0 - quotient * s1
    if r0 != 1:
        raise ValueError('x is not invertible for the given modulus')
    return s0 if s0 >= 0 else s0 + m

Method B (exponentiation with m-2, Euler's theorem)

Code:
def invert_b(x: int, m: int) -> int:
    # reference: https://en.wikipedia.org/wiki/Modular_multiplicative_inverse#Using_Euler's_theorem
    return pow(x, m-2, m)

Since Python 3.8, you can pass negative exponents into the pow function, leading to two more ways:

Method C (exponentiation with -1)

Code:
def invert_c(x: int, m: int) -> int:
    return pow(x, -1, m)

Method D (exponentiation with -m)

Code:
def invert_d(x: int, m: int) -> int:
    return pow(x, -m, m)

Python is not really meant for performance, but I figured I'd benchmark all four methods out of curiosity. After testing that they all produce identical answers (for x >= 1 and x < m, with the moduli that I care about: p and n from secp256k1), I inverted 100,000 private keys 10 separate times with each method and then took the mean:

+---------------+-------+
| CPython 3.9.2 | Speed |
+----+----------+-------+
     | Method A | 3.85x | |||||||||||||||||||||||
     +----------+-------+
     | Method B | 1.15x | |||||||
     +----------+-------+
     | Method C | 7.80x | |||||||||||||||||||||||||||||||||||||||||||||||
     +----------+-------+
     | Method D | 1.00x | ||||||
     +----------+-------+


I was surprised that method A held up so well, considering that the other methods consist only of a single call to pow and therefore do almost all of their work in native code.

Why is method C the fastest? Peeking at the relevant CPython code (inside the implementation of pow: here), it looks like it's handling negative exponents with an internal implementation of the extended Euclidean algorithm (this one: here). I was puzzled at first by method D being so much slower than method C (I expected it to be slower, just not 7.8x) but then realized that although it does make use of the same internal special-casing that makes method C so fast, it then follows that with an expensive (especially for big values of m) modular exponentiation that isn't special-cased. To be clear, method C (internally) goes the same way (the pow function inverts the base and negates the exponent before proceeding), but with 1 as the exponent the rest of the calculation terminates quickly.

If you're on Python 3.8+, use method C, it's both the fastest and the cleanest (IMHO). If you're stuck on an older version of Python and need more performance, prefer method A over method B. I haven't been able to think of a good use case for method D.

Note that none of the above methods are safe to use with respect to side-channel attacks; they all leak information via small argument-dependent differences in execution time and/or power consumption. That's a very academic threat for most users though, so don't let that stop you from learning/experimenting. If you're writing a professional wallet, then you'll have to step up to more advanced techniques.
Jump to: