Can anyone explain why a sloppy modular multiplication is used? The 3rd step is missing, otherwise there exist probability of an incorrect result. Or is it proven that only 2 steps were sufficient (I doubt it)?
I could brag that I managed to write from scratch a completely working Kangaroo that currently works 6x (six times) faster than both the original and all the n00b clones out there. I'm not gonna sell it or release it because for one, I don't need to prove anything to anyone except myself, and secondly, it's really ok if no one believes me that I can squeeze out 900 million jumps/s on an underclocked RTX 3050 laptop GPU that can never reach more than 200 Mops/s with JLP's app. BTW, the stats on JLP's program are biased, the real speed is slower than the one printed on the screen (there's some non-sense smoothing speed computation in there, and the computed time durations are bugged between worker threads and main thread). Real speed is 10% slower due to these bugs. Whatever.
It's impressive that you believe you've developed a version of the Kangaroo that outperforms both the original and all the clones by such a significant margin. Achieving 900 million jumps per second on an underclocked RTX 3050 laptop GPU is indeed a remarkable accomplishment!
Achieving a 6x speed improvement is no small feat, especially considering the limitations you’ve highlighted in the original program. If you truly have a solution that performs this well, sharing it on GitHub isn't just about proving anything to anyone. It's about fostering innovation, collaboration, and growth within the community. By making your work public, you can inspire others, drive progress, and leave a lasting impact on the field. Plus, being part of a collaborative effort can lead to new insights and improvements that benefit everyone.
So, why not contribute to the community and share your impressive work?
This coming from a guy who confuses me with some digaran dude. I came across this puzzle in January this year while searching for some tools to help me recover a wallet.dat password from 2016.
It's not about "believing", and I truly don't care on making an impression. I already indicated multiple times what the bottlenecks are:
- use of GPU global memory due to use of local stacks, due to use of GPU group size and multiple places where you don't even need a stack at all. This is the #1 slowdown.
- parts of the algorithm can be merged together, some other parts are really just handbook methods, unoptimized to CUDA hardware and the parallel paradigm
Using registers only, smaller group size, actually using the fast shared memory on-chip for inter-thread calculations, thinking better what the algorithm implies when run on massively number of threads, do make an implementation run 6x faster. This is not simply some optimizations in JLP's Kangaroo, I am talking about a completely from scratch well-thought system here. It's very nice that this software was used to break 115 bit puzzle, but I will again have my suspicions that this was used for 120 or 125. It is just not up to that level of competence, from a design perspective.
Can anyone explain why a sloppy modular multiplication is used? The 3rd step is missing, otherwise there exist probability of an incorrect result. Or is it proven that only 2 steps were sufficient (I doubt it)?
Since no one answered, I will assume it's taken for granted that the multiplication used by JLP is correct (2-folding reduction steps). However, according to page 15 of this research paper:
https://eprint.iacr.org/2021/1151.pdfunder the analysis of "Sloppy reduction" it is proven that unless a correctness check is performed, then you need a 3rd reduction step.
Otherwise, the modular multiplication reduction has a chance (around one in 2**193) to be incorrect, because:
Thus, if sloppy reduction is applied to a randomly chosen number in the interval [0, R2 − 1], the probability of an incorrect result is about r(r−1) / 2N
There is a missing potential carry that is assumed it can never happen, after the 2nd reduction. And no one can explain it.