In the multiplication function of the libsecp256k1 repository at Github, there is a stranger constant that the carry is multiplied with in various steps. It is referenced as the constant R and has the value of 0x1000003d10.
Everything else is more or less explanatory. But what is this constant supposed to mean? Remainder?
Example:
https://github.com/bitcoin-core/secp256k1/blob/master/src/field_5x52_asm_impl.hIt is pretty simple.
Imagine you want to multiply 2 numbers mod 97 (100 -
3) and you have only 2 digits in memory:
for example 23*49 mod 97 (note that 97 = 100 -
3)
23*49 = 1127
how to compute efficiently now 1127 mod 97 ?
the idea is: 1127 = 11*100 + 27
a) 1127 mod 100 = 27 (the lowest 2 digits of the multiplication, remember, you don't have space for 4 digits)
b) then multiply the highest digit 11 *
3 = 33
c) result = 33 + 27 = 60
explanation --> 1127 = 11*100 + 27 = 11*(97 + 3) + 27 = 11*97 + 11*3 + 27 = 11*3 + 27 mod 97
Same thing with a*b mod p
p = 2*256 -
Ra*b = (highest 256 bits of a*b) * 2^256 + lowest 256 bits of a*b
a) lowest 256 bits = (a*b) mod 2^256
b) then compute (highest 256 bits)*
Rc) result = (highest 256 bits)*
R + (a*b) mod 2^256
explanation --> result = a*b = highest * 2^256 + lowest = highest * (2^256 - R) + highest *
R + lowest = (highest *
R + lowest) mod p
EDIT :
R is used for fast reduction of the result of the multiplication, from 512 bits to 256 bits, because
R is small compared to 2^256
(2^256) mod p << 4 is equal to (2^256-p) * 16 (or 2^4, or 4^2). Maybe it could be used for some kind of reduction, I don't know.
Sorry to interrupt, but I found an article kind of explaining the
Mathematics behind secp256k1_fe_mul_inner Maybe it will be useful in terms of clarifying some unclear points (I'm not sure, because to me these formulas look like ancient hieroglyphs).
Those were excellent, thank you very much.
So let's put all this together, converting to 5x52 legs for brevity:
a = a0 + a1 × 2^52 + a2 × 2^104 + a3 ^ 2^156 + a4 × 2^208
b = b0 + b1 × 2^52 + b2 × 2^104 + a3 ^ 2^156 + b4 × 2^208
And the terms we need to sum are:
t0 = a0*b0
t1 = a0*b1 + a1*b0 + t0's carry
t2 = a0*b2 + a1*b1 + a2*b0 + t1's carry
t3 = a0*b3 + a1*b2 + a2*b1 + a3*b0 + t2's carry
t4 = a0*b4 + a1*b3 + a2*b2 + a3*b1 + a0*b4 + t3's carry
t5 = a1*b4 + a2*b3 + a3*b2 + a4*b1 + t4's carry
t6 = a2*b4 + a3*b3 + a4*b2 + t5's carry
t7 = a3*b4 + a3*b4 + t6's carry
t8 = a4*b4 + t7's carry
t9 = t8's carry
So the multiplication is:
a*b = t0 + t1 * 2^52 + t2 * 2^104 + t3 * 2^156 + t4 * 2^208 + t5 * 2^260 (pay attention to this term, it will morph into a bitshift and appear at the end of this post) + t6 * 2^312 + t7 * 2^364 + t8 * 2^416 + t9 * 2^468
And since 2^256 ≡ 0x1000003D1 (mod p):
a*b = t0 + t1 * 2^52 + t2 * 2^104 + t3 * 2^156 + t4 * 2^208 + t5 * 2^4 * 0x1000003D1 + t6 * 2^56 * 0x1000003D1 + t7 * 2^108 * 0x1000003D1 + t8 * 2^160 * 0x1000003D1 + t9 * 2^212 * 0x1000003D1
Note that this is equal to:
a*b = t0 + t1 * 2^52 + t2 * 2^104 + t3 * 2^156 + t4 * 2^208 + t5 * 2^4 * 0x1000003D1 + t6 * 2^52 * 2^4 * 0x1000003D1 + t7 * 2^104 * 2^4 * 0x1000003D1 + t8 * 2^156 * 2^4 * 0x1000003D1 + t9 * 2^208 * 2^4 * 0x1000003D1
Or in other words:
a*b = t0 + t1 * 2^52 + t2 * 2^104 + t3 * 2^156 + t4 * 2^208 + t5 * (0x1000003D1 << 4) + t6 * 2^52 * (0x1000003D1 << 4) + t7 * 2^104 * (0x1000003D1 << 4) + t8 * 2^156 * (0x1000003D1 << 4) + t9 * 2^208 * (0x1000003D1 << 4)
(0x1000003D1 << 4) is a simplification of the expression where we eliminate a 2^4 from the higher terms. It is cyclical and now I see why the devs chose to use 52-bit and 26-bit legs.
But the code only references t0 to t4 (t3 and t4 to be specific) so what happen to those higher terms? Well, lets simplify some more, taking advantage of the fact that we have just converted their bases from higher (and non-accessible) limbs to lower limbs:
a*b = (t0 + t5 * (0x1000003D1 << 4)) + (t1 + t6 * (0x1000003D1 << 4)) * 2^52 + (t2 + t7 * (0x1000003D1 << 4)) * 2^104 + (t3 + t8 * (0x1000003D1 << 4)) * 2^156 + (t4 + t9 * (0x1000003D1 << 4)) * 2^208
So it looks like they only coded t3 and t4 to avoid repeated multiply-adds
The t's can now be expanded:
a*b =
(
a0*b0 + a1*b4 + a2*b3 + a3*b2 + a4*b1 + t4's carry) * (0x1000003D1 << 4))
+ ((
a0*b1 + a1*b0 + t0's carry + a2*b4 + a3*b3 + a4*b2 + t5's carry) * (0x1000003D1 << 4)) * 2^52
+ ((
a0*b2 + a1*b1 + a2*b0 + t1's carry + a3*b4 + a3*b4 + t6's carry) * (0x1000003D1 << 4)) * 2^104
+ ((
a0*b3 + a1*b2 + a2*b1 + a3*b0 + t2's carry + a4*b4 + t7's carry) * (0x1000003D1 << 4)) * 2^156
+ ((
a0*b4 + a1*b3 + a2*b2 + a3*b1 + a0*b4 + t3's carry + t8's carry) * (0x1000003D1 << 4)) * 2^208
I use bold to differentiate between the first and second t's in each limb.
Now this should clear up two things:
1) Why R is set to 0x1000003D1 << 4 or equivalently 0x1000003D10 (which I will just call R4) and why it's multiplied so much in the calculations
2) By the same extension, why this method returns all limbs multiplied by R. As for the last limb, I suspect that the multiplication by R is entirely redundant and lost to the modulus, otherwise they would've performed it.
As for why Python & GMP returns an answer not multiplied by R, I can now explain it simply: the answer multiplied by R4 is only valid in modular arithmetic, but not in normal arithmetic.
I suspect (and this is merely speculation, because I haven't tried it yet - but who knows
) that the result with 0xa*R4, 0x8*R4, 0x6*R4, 0x4*R4, 0x2*R4, when modulo'd by the curve characteristic, will give the corresponding number without the multiplied Rs. Lets see:
0xa*R4, 0x8*R4, 0x6*R4, 0x4*R, 0x2*R4
= 0xa*(2^256-p)^4, 0x8*(2^256-p)^4, 0x6*(2^256-p)^4, 0x4*(2^256-p)^4, 0x2*(2^256-p)^4
= 0xa*2^260 - 0xa*p^4, 0x8*2^260 - 0x8*p^4, 0x6*2^260 - 0x6*p^4, 0x4*2^260 - 0x4*p^4, 0x2*2^260 - 0x2*p^4
= 0xa*2^260, 0x8*2^260, 0x6*2^260, 0x4*2^260, 0x2*2^260 (eliminate modulus of all terms multiplied by p)
So we are left with just 0xa*2^260, 0x8*2^260, 0x6*2^260, 0x4*2^260, 0x2*2^260 which has got 256 + 4 bits shifted to the left. So we're stuck here, right? Wrong. I can explain this by noting that 2^52 (and 2^26) are factors of 2^260, so those powers will be replaced with a multiplication by 1. (Specifically, 2^260 is greater than the place value of each leg. If it were merely multiplied by 2^52 for example, it would have to be multiplied with the second-lowest leg.)
So we have just proved that the
52/26-bit leg, secp256k1 modular multiplication* is equivalent to the original GMP and Python multiplication. But you absolutely have to reduce the product to the smaller form, otherwise, all the libraries doing non-52/26 leg modular arithmetic will be confused.
*This only applies to 52 and 26 bit legs (and 104, 208 and so on as well as 13, even) and no others.
One drawback I discovered in the process of making these posts is that these hand-tuned libsecp256k1 multiplications and squares will only work with public keys, since the modulo p (characteristic) is explicit. In particular,
they will give the wrong answer for private key multiplication. So I guess Bitcoin Core is not like BitCrack and never has to multiply private keys together. I wasted a lot of time a few weeks ago trying to implement key shifting for a friend using libsecp256k1 without realizing this, and then I wondered why the multiplication was all wrong.
The good news, the constant R and R4 can be replaced by their counterparts for 2^256 - n (curve order), thanks to that insight by arulbero.
I might do a follow-up post mapping each part of the code to the respective part of the equation, it is really fascinating.
Thanks guys for making this discovery possible.