#120 was pretty easy, just 20pcs of 4090 for six weeks.
Let's say you needed 1.36 sqrt(n) ops, that would mean
1.36*sqrt(2**119)/(14*86400)/20
op/s per each GPU, which is close to
46 Gk/sNow, a 4090 can do at max 82 Tflop/s
That's 1782 float32 ops for a single jump.
For a jump you'd need around 6 256-bit muls and 6 additions (if inversions amortized to negligible).
To do a 256-bit mod mul you'd need around 8 int32 * 8 int32 = 64 32-bit multiplications and 56 32-bit additions = 119 operations just for the first step and around another 18 mul + a few adds or so to reduce mod N.
For a single 256-bit addition you'd need around 8 operations (16 to do mod N).
That adds up to around 170 or so ops per 256-bit mul and 16 per add.
Estimated cost: 170 * 6 + 16 * 6 = 1116 ops.
Theoretically, feasible since 1116 < 1782, but are you some kind of a genius or WTF? I didn't even put in the rest of the overhead work required.
LE: ok, I f**ed up. I noticed you said
six weeks.
Still, that ends up as
15 Gk/s which is crazy fast.
---
LLE. Damn, you made me go back to the drawing board, just thinking on what you might have used to get to such speeds. For some of the things I came up with as explanation, there's literally just a couple of results on the entire web, but if those work as advertised, I may get a 3x speedup as well, just by freeing up a lot of registers that keep useless information. There's no other way I see, since I already got 100% compute throughput in the CUDA profiler but I'm nowhere near your speed. Great, now I also have a headache.