I never got further than 2**50 steps. Somewhere on 2**45 it starts to slow down.
Steps duration more than doubles than the time of the previous depth, at each depth, duh.
Try to make SECPK1 6x (six times) faster first.
I think it's more like "stop using a secp256k1 batch group addition that runs 6x times slower than what the hardware can accomplish".
Challenge accepted. Let's do this, shall we?
1. CPU: speed is faster when using 5 x 52-bit limbs instead of 4 x 64-bit limbs because instructions can now run in parallel (SIMD) instead of limbs being dependent on each other's results (due to carry flag propagation, which is a performance killer on any CPU).
(this alone gets you a 50% speedup at least)
NOW... how many tools that have blatantly copied each other's "fast" code assumed "this is the fastest, I won't bother"?
2. There's a trick to also compute the batched inverse faster (same idea, get rid of partial products inter-dependency).
(speedup is around 5%)
There's some secp256k1 ASM code out there that uses 4x64-bit limbs, which is slower than using no ASM and 52-bit limbs, and letting the compiler do the SIMD instruction packaging.
Refs for the above: Bernstein (some 2009 PDF), also "Modern computer arithmetics".
3. GPU: this is way more problematic. Let's start with some facts:
- GPU global & local (stack) memory is very, very slow to access; fastest code will do very few memory loading and storing.
- all threads compute the exact same instruction at every cycle (conditional branches are masked no-ops)
- all instructions are actually executed by 32-bit arithmetic units (64-bit are emulated)
- you have some small amount (48 kB) of fast on-chip shared memory accessible to all threads in a block / SM.
- you have some very small amount (1 KB) of extremely (fastest possible) registers on each thread
Now, if we do the math depending on GPU specs, clock frequency, how many cycles a fused multiply-add instruction takes, etc, we can arrive at some very impressionable numbers, for example of how many secp256k1 field multiplications / s (with modular reduction included et all) the GPU is capable of. These values are indicating to us that "hey, why the hell can we do billions of modular mul/s but when we run JLP's kangaroo the numbers are 10% of the theoretical peak?"
Now... looking at JLP's Kangaroo which you call "the fastest", what do we see?
- kangaroos get copied between global memory and local memory (stack), but if the stack gets too large (which it does, because it wants a large "gpu group size" to theoretically speed up inverse calculus), this just results of copying from global memory to global memory, doubling both the loading/storing, AND the required memory size, and slowing the runtime overall).
Someone might say: this is done for coalescing reasons, etc. - why the hell is this the GPU kernel's concern, not the host's?
- on-chip shared memory is completely ignored and used as L1 cache (but what is it caching? one-time memory reads? oh no...);
- are all the PTX-base routines taking notice that the GPU has a multiply-and-add 32-bit HW instruction? answer: no, everything is a 64-bit pointer...;
- kernel logic for batched addition is convoluted and modularized to the point it makes the compiler optimizations close to impossible, increasing the register pressure.
- at the highest possible level, a lot of things can be strategically better thought. Things like "X items were lost because you have no idea what arguments to use, bro" should not exist. The GPU should act as a continuous DP generator, with no lunch breaks between the kernel launches. Separate better the concerns between producing DPs and consuming DPs.
Just are these some of the ideas. There are more which I won't mention, since it's much too technical.