Pages:
Author

Topic: Solving ECDLP with Kangaroos: Part 1 + 2 + RCKangaroo - page 2. (Read 3351 times)

newbie
Activity: 19
Merit: 0
v2.0 (Windows/Linux):
https://github.com/RetiredC/RCKangaroo

- added support for 30xx, 20xx and 1xxx cards.
- some minor changes.

Speed:
4090 - 7.9GKeys/s.
3090 - 4.1GKeys/s.
2080Ti - 2.9GKeys/s.

Please report speed for other cards, for old cards speedup is up to 40%.


Thank you so much for your incredible contribution to this tool!

I believe it’s already fantastic, but there’s just one feature that could make it truly perfect: the addition of an --end parameter.

Here’s an example to illustrate what I mean:

Puzzle 135: 40000000000000000000000000000000000000:7fffffffffffffffffffffffffffffff

I’d like to try my luck and search within a specific range, such as:
--start 52000000000000000000000000000000000000
--end 5affffffffffffffffffffffffffffff

The inclusion of an --end parameter would be a game-changer for scenarios like this.

Thanks in advance for your consideration
so that the program ends after a complete search.
newbie
Activity: 19
Merit: 0

Please report speed for other cards, for old cards speedup is up to 40%.
  Speed: 2310 MKeys/s - 3070
Well done! Can I ask you a question about kangaroos? in private messages.
newbie
Activity: 22
Merit: 1
Please report speed for other cards, for old cards speedup is up to 40%.

NVIDIA GeForce RTX 3060, 11.76 GB, 28 CUs, cap 8.6, PCI 40, L2 size: 2304 KB
Speed: 1.5 GKeys/s

NVIDIA GeForce GTX 1660 SUPER, 5.80 GB, 22 CUs, cap 7.5, PCI 39, L2 size: 1536 KB
Speed: 1.1 GKeys/s

2x GTX 1660 SUPER + 1x RTX 3060 (-gpu "012")
Speed: 3.7 GKeys/s

22m to solve key 85bit  Roll Eyes

Thanks man!  Grin
jr. member
Activity: 65
Merit: 1
34Sf4DnMt3z6XKKoWmZRw2nGyfGkDgNJZZ
v2.0 (Windows/Linux):
https://github.com/RetiredC/RCKangaroo

- added support for 30xx, 20xx and 1xxx cards.
- some minor changes.

Speed:
4090 - 7.9GKeys/s.
3090 - 4.1GKeys/s.
2080Ti - 2.9GKeys/s.

Please report speed for other cards, for old cards speedup is up to 40%.

I request that CPU support be added. We are doing some studies and research, using only GPU is not enough. Sometimes we need to use only CPU. Please add CPU support.
?
Activity: -
Merit: -
v2.0 (Windows/Linux):
https://github.com/RetiredC/RCKangaroo

- added support for 30xx, 20xx and 1xxx cards.
- some minor changes.

Speed:
4090 - 7.9GKeys/s.
3090 - 4.1GKeys/s.
2080Ti - 2.9GKeys/s.

Please report speed for other cards, for old cards speedup is up to 40%.


Thank you so much for your incredible contribution to this tool!

I believe it’s already fantastic, but there’s just one feature that could make it truly perfect: the addition of an --end parameter.

Here’s an example to illustrate what I mean:

Puzzle 135: 40000000000000000000000000000000000000:7fffffffffffffffffffffffffffffff

I’d like to try my luck and search within a specific range, such as:
--start 52000000000000000000000000000000000000
--end 5affffffffffffffffffffffffffffff

The inclusion of an --end parameter would be a game-changer for scenarios like this.

Thanks in advance for your consideration
newbie
Activity: 7
Merit: 0
Please report speed for other cards, for old cards speedup is up to 40%.

NVIDIA GeForce GTX 1660 Ti "Laptop" 975 MKeys/s
sr. member
Activity: 652
Merit: 316
Please report speed for other cards, for old cards speedup is up to 40%.
Thanks!
1660Super 930-948Mkey/s
?
Activity: -
Merit: -
v2.0 (Windows/Linux):
https://github.com/RetiredC/RCKangaroo

- added support for 30xx, 20xx and 1xxx cards.
- some minor changes.

Speed:
4090 - 7.9GKeys/s.
3090 - 4.1GKeys/s.
2080Ti - 2.9GKeys/s.

Please report speed for other cards, for old cards speedup is up to 40%.
member
Activity: 165
Merit: 26
If X1, Y1, X2, Y2, Z, and the jump distance are all in registers, you get maximum speed. You can never get faster than that. But you can only do it with a very small number of kangaroos, like up to 6 or 7 kangaroos, depending on how well you use the registers.

Next wall is when L1 + shared memory gets used. This is the next maximum possible speed, but lower speed per kangaroo then above. You can add maybe 1 kangaroo more this way, because this cache is really small (128 KB per SM).

Third wall is using L2 cache. So, much lower speed / kangaroo, even though the overall throughput is greater. This scales up better only if L2 is really big.

This is why: when L2 cache is small, then loading and storing X1 Y1 in/out of L2 before and after each jump is much too slow, because they will actually pass into deice global memory. So, if L2 is small, the only logical (and faster) option is to only load and store X1 Y1 before all the jumps, and store them back after all the jumps.

Adding cycle handling in this case reduces the number of allowed X1 Y1 = less kangaroos = lower speed.

I have like seven different kernel versions that test these different strategies for when and how data is being loaded into/to different memory levels, so I'm pretty sure about the differences in each strategy.
?
Activity: -
Merit: -
The comparison between SOTA and 3-kangaroo you made is only relevant if the jumping device (RTX 4090 in your case) allows the tradeoffs you mention, in order for the tradeoffs to end up with a more efficient solver.
Let me give you an example.
RTX 3050 or other cards that have a low amount of L2 memory.
Your RCKangaroo is much slower than a non-cycling jumper, like 50% slower, because the overhead of handling the cycles is clearly visible (no L2 memory to hide computing bounds). It is even worse when the kangaroos are stored in global memory (since L2 is too small to cache them).
So the winner in these cases is a normal (optimized) 3-kangaroo algorithm.
But because RTX 4090 has around 75 MB of L2 cache, the lower speed is hidden because it is offset by the fast L2 cache. In other words, the raw speed becomes irrelevant, because now the speed bound is limited by the memory latency of L2, not by the raw computing cores.
In this case you can do pretty much whatever you want in the kernel and go crazy with whatever algorithm you want to use, like your SOTA method. Because it won't affect the speed too much.
What I want to say is: in the future you can't really know if new GPUs will have the same tradeoff benefit, so there is no guarantee that the computing size of a CUDA device won't have a greater influence on speed when compared to the cache latency and size of L2 memory.

You are wrong, it's not related to L2, loop handling slowdown for new and old cards is similar. But I won't argue.
I will release new version with 20-35% speedup for old cards soon. Not much optimized, but anyway faster than now.
member
Activity: 165
Merit: 26
The comparison between SOTA and 3-kangaroo you made is only relevant if the jumping device (RTX 4090 in your case) allows the tradeoffs you mention, in order for the tradeoffs to end up with a more efficient solver.

Let me give you an example.

RTX 3050 or other cards that have a low amount of L2 memory.

Your RCKangaroo is much slower than a non-cycling jumper, like 50% slower, because the overhead of handling the cycles is clearly visible (no L2 memory to hide computing bounds). It is even worse when the kangaroos are stored in global memory (since L2 is too small to cache them).

So the winner in these cases is a normal (optimized) 3-kangaroo algorithm.

But because RTX 4090 has around 75 MB of L2 cache, the lower speed is hidden because it is offset by the fast L2 cache. In other words, the raw speed becomes irrelevant, because now the speed bound is limited by the memory latency of L2, not by the raw computing cores.

In this case you can do pretty much whatever you want in the kernel and go crazy with whatever algorithm you want to use, like your SOTA method. Because it won't affect the speed too much.

What I want to say is: in the future you can't really know if new GPUs will have the same tradeoff benefit, so there is no guarantee that the computing size of a CUDA device won't have a greater influence on speed when compared to the cache latency and size of L2 memory.
?
Activity: -
Merit: -
Hi! pretty cool tool :-)
Do you plan to implement if possible to have
 - a "continue" option in case the gpu stop and you want to continue from where is stopped ?
 - a "multiple pubkey" option like an input file with a list of pubkey ?
thanks and have a good one

1. Maybe.
2. No, it's bad idea.

I must admit you used some really clever tricks to make maximum usage of shared memory (L1) and L2 caches. I'm still trying to figure out the way you keep track of the jump distances using the shared memory instead of updating them using L2.

After adapting my own kernel to load/store stuff using L2 (instead of only once, before and after all the jumps) I reached 9.7 GK/s on RTX 4090 (64 jump points, DP 32), which was an increase of 75% in speed, and I haven't even tried to do micro-optimizations on it, like before. So I guess this was the missing lack of knowledge to be able go beyond the advertised 8+ GK/s stated by others around here, after trying every possible advanced optimizations I could think of to speed things up.

So did you start work on solving 135?

Yes, 10G for 4090 is ok.
And then one day you will understand that the only way to improve it further - use symmetry and get sqrt(2) boost. Yes you will lose some speed but total improvement worth it.
From RCKangaroo readme:

Fastest ECDLP solvers will always use SOTA method, as it's 1.39 times faster and requires less memory for DPs compared to the best 3-way kangaroos with K=1.6. Even if you already have a faster implementation of kangaroo jumps, incorporating SOTA method will improve it further. While adding the necessary loop-handling code will cause you to lose about 5–15% of your current speed, the SOTA method itself will provide a 39% performance increase. Overall, this translates to roughly a 25% net improvement, which should not be ignored if your goal is to build a truly fast solver.

member
Activity: 165
Merit: 26
I must admit you used some really clever tricks to make maximum usage of shared memory (L1) and L2 caches. I'm still trying to figure out the way you keep track of the jump distances using the shared memory instead of updating them using L2.

After adapting my own kernel to load/store stuff using L2 (instead of only once, before and after all the jumps) I reached 9.7 GK/s on RTX 4090 (64 jump points, DP 32), which was an increase of 75% in speed, and I haven't even tried to do micro-optimizations on it, like before. So I guess this was the missing lack of knowledge to be able go beyond the advertised 8+ GK/s stated by others around here, after trying every possible advanced optimizations I could think of to speed things up.

So did you start work on solving 135?
newbie
Activity: 9
Merit: 0
Some explanations about other GPUs support:
1. I have zero interest in old cards (same for AMD cards) so I don't have them for development/tests and don't support them.
2. You can easily enable support for older nvidia cards, it will work, but my code is designed for the latest generation, for previous generations it's not optimal and the speed is not the best, that's why I disabled them.


Hi! pretty cool tool :-)
Do you plan to implement if possible to have
 - a "continue" option in case the gpu stop and you want to continue from where is stopped ?
 - a "multiple pubkey" option like an input file with a list of pubkey ?

thanks and have a good one
?
Activity: -
Merit: -
I'm not going to create serious ready-to-use open-source solution for cracking really high ranges. You should do it by yourself if you want to crack #135 and get a lot of money Smiley
But I'm going to update RCKangaroo to support old cards better (+higher speed) when I have time.
It would actually be great if 20xx GPUs could be supported!  And if in addition a speed optimization for these 20xx and 30xx series is made great . Thank you  Smiley

Yes, new version will support these cards and will be faster by at least 20-30% for these cards.
member
Activity: 127
Merit: 32
I'm not going to create serious ready-to-use open-source solution for cracking really high ranges. You should do it by yourself if you want to crack #135 and get a lot of money Smiley
But I'm going to update RCKangaroo to support old cards better (+higher speed) when I have time.
It would actually be great if 20xx GPUs could be supported!  And if in addition a speed optimization for these 20xx and 30xx series is made great . Thank you  Smiley
?
Activity: -
Merit: -
@RetiredCoder Maybe you can share networking and dp load/save options. Do you have any plan on it?  

I'm not going to create serious ready-to-use open-source solution for cracking really high ranges. You should do it by yourself if you want to crack #135 and get a lot of money Smiley
But I'm going to update RCKangaroo to support old cards better (+higher speed) when I have time.
jr. member
Activity: 47
Merit: 1
@RetiredCoder Maybe you can share networking and dp load/save options. Do you have any plan on it? 
?
Activity: -
Merit: -
“Your implementation of the Kangaroo method to solve the ECDLP (Elliptic Curve Discrete Logarithm Problem) is interesting,

Stop posting AI BS here, I will remove it every time.
?
Activity: -
Merit: -
If one RTX4090 card can do it in +/- 249934 days (68 years). Then eighty RTX4090 cards can do it i in less than a year.


Who taught you mathematics?
249934 days=684 years



with this math, we need +400 4090 GPU to solve the 135 in 1.7 years or something like this, and in half a year but we need to use +1000 GPU Shocked  Grin
if I'm not wrong this is a hallucination  Cheesy


in this situation, as the Retired Coder tells us one GPU costs between 0.20 to 0.30 per hour, this may cost 180$ month > 2160$ year

so 400 card cost (864000$)
1000 cost (2160000$)

if I'm not wrong

I think I will still laugh for the next year  Grin Cheesy
Pages:
Jump to: