A faster modular multiplication implementation for the P-192 curve.
Granted this is the wrong curve, but it has the same general formula y2 = x3+ax+b, so all that theoretically has to be done is to extend the summations to a few extra terms (they'll just be 32 words instead of 24, which means a 32x32 word multiplication size instead of 24x24) and accommodate for the missing ax.
E.g. where it defines range shifted representation (RSR) of a large number, which it uses in the algorithm:
X = ∑ i=0,i=23 xiWi−12 = x0W−12 + x1W−11 + ⋯ + x23W11,
Y = ∑i=0,i=23 yiWi−12 = y0W−12 + y1W−11 + ⋯ + y23W11,
Z = X ⋅ Y = ∑i=0,i=47 ziWi−24 = z0W−24 + z1W−23 + ⋯ + z47W23
We can just replace it with this:
X = ∑ i=0,i=31 xiWi−16 = x0W−16 + x1W−15 + ⋯ + x31W15,
Y = ∑i=0,i=31 yiWi−16 = y0W−16 + y1W−15 + ⋯ + y31W15,
Z = X ⋅ Y = ∑i=0,i=63 ziWi−32 = z0W−32 + z1W−31 + ⋯ + z63W31
(x, y, z summation terms are all between 0 and 255 inclusive)
Then the rest of the modular multiplication follows suit.
They say it's 26% faster than the usual ECC multiplication algorithm. I wonder if this representation also handles modular addition well. If it does, then we got ourselves a faster way to do modular arithmetic.
The study was carried out on some 8-bit embedded machine. Modern pipelines stuff 64 bits in a register so we may even be able to reduce the number of RSR words by a factor of 8 i.e. only 2 x and y words and 4 z words (and ovbiously the maximum range of each word increases proportionally).