Pages:
Author

Topic: 5-7 kangaroo method (Read 967 times)

sr. member
Activity: 1579
Merit: 267
October 21, 2024, 06:28:41 AM
#44
My point is clear: even a single kangaroo must be fast. If your kangs jump 46K per seconds each - you are on the right way Smiley
Though usually with this approach you will suffer from slow inversion.

thx. Well said, usually - since you basically solved 120 single-handed, I'm sure we both know this isn't the case, unless it's coded by a 1st year college student (or forked from github, which may be even worse). Or is it? Let's leave this as a mistery at least for a while, please, probably less than 3 guys here may understand it. You seem to know a few things, mister...

You can call me Mr. Hot.
?
Activity: -
Merit: -
January 04, 2025, 03:59:39 PM
#40
I feel like you're talking nonsense and I can't understand a single word.
newbie
Activity: 9
Merit: 0
December 29, 2024, 03:11:29 AM
#39
LLE. Damn, you made me go back to the drawing board, just thinking on what you might have used to get to such speeds. For some of the things I came up with as explanation, there's literally just a couple of results on the entire web, but if those work as advertised, I may get a 3x speedup as well, just by freeing up a lot of registers that keep useless information. There's no other way I see, since I already got 100% compute throughput in the CUDA profiler but I'm nowhere near your speed. Great, now I also have a headache.

Don't work so hard, there is no reason for that Smiley
One more tip for you. As far as I remember there is one more problem in original JLP GPU code: every GPU thread processes A LOT of kangs. It will cause an issue with DPs for high-number puzzles like #130. It's obvious but it seems nobody understands it. Don't you see it too?

I remember that JLP kangaroo only support less than 125bit interval search as the output dist is only 32*4 bits?
sr. member
Activity: 1579
Merit: 267
October 21, 2024, 06:26:57 AM
#38
That dude expands the math as you gets hooked.

My point is clear: even a single kangaroo must be fast. If your kangs jump 46K per seconds each - you are on the right way Smiley
Though usually with this approach you will suffer from slow inversion.

thx. Well said, usually - since you basically solved 120 single-handed, I'm sure we both know this isn't the case, unless it's coded by a 1st year college student (or forked from github, which may be even worse). Or is it? Let's leave this as a mistery at least for a while, please, probably less than 3 guys here may understand it. You seem to know a few things, mister...

You can call me Mr. Hot.
member
Activity: 165
Merit: 26
October 20, 2024, 03:51:40 PM
#37
My point is clear: even a single kangaroo must be fast. If your kangs jump 46K per seconds each - you are on the right way Smiley
Though usually with this approach you will suffer from slow inversion.

thx. Well said, usually - since you basically solved 120 single-handed, I'm sure we both know this isn't the case, unless it's coded by a 1st year college student (or forked from github, which may be even worse). Or is it? Let's leave this as a mistery at least for a while, please, probably less than 3 guys here may understand it. You seem to know a few things, mister...
?
Activity: -
Merit: -
October 20, 2024, 02:23:40 PM
#36
Yeah, but still, what's your point? That JLP had it wrong? I already knew this!

My point is clear: even a single kangaroo must be fast. If your kangs jump 46K per seconds each - you are on the right way Smiley
Though usually with this approach you will suffer from slow inversion.

What matters is the effective total speed, not the speed of each individual kangaroo.

If each kangaroo jumps only 180 times a sec - it really matters  Grin
member
Activity: 110
Merit: 61
October 20, 2024, 12:57:44 PM
#35
So for DP=32 every kang finds a DP every 270 days (on average).
Correct?
Now the problem is obvious, isn't it?

Yes, such a problem does exist. The kangaroos still find the DP, because of their huge amounts, but the paths are too short and the first DP found by single kangaroo is always much worse than the subsequent ones.

As for me, I launched JLP Kangaroo with specially modified grid size, group size and other slight improvements, lets call it "fast kangaroos". But I personally know programming in CUDA quite poorly, so I couldn't write a good kernel for these conditions, and I couldn't achieve the necessary speed increase so that, so to speak, quality worked, not quantity.


?
Activity: -
Merit: -
October 20, 2024, 11:44:56 AM
#34
Yeah, sorry, forgot I had changed the group size to 512, though it only impacts the initial time to compute the kangaroos, not the kernel performance. Some people here claim to get to 4 Gk/s on a RTX 4090, I did not investigate too much what the best grid size would actually be, I gave up trying as soon as the GPU -check flag was filling the stdout with "warning: xxxxx DPs were lost" and "item not found"...

Probably they use even larger grid for 4GH...
So if we have 16.7M kangs at 3GH it means that each kang jumps 180 times per sec, it's 15.5M per day.
For #130 you have to use DP=30 at least, or better 32 (especially for #135).
So for DP=32 every kang finds a DP every 270 days (on average).
Correct?
Now the problem is obvious, isn't it?
member
Activity: 165
Merit: 26
October 20, 2024, 11:33:45 AM
#33
No, I don't mean the speed itself, the problem is in DPs.
To demonstrate it, I need to know best speed parameters for 4090 with JLP's app.
When I start Kangaroo.exe with default parameters it shows "Grid(256x128)" and speed is about 1.8GH. I remember that speed can be better if use bigger grid, but don't remember best values. I think nobody uses it with default grid size if there are better values.
If you know them, I will explain the problem with exact numbers.

Optimal grid size in JLP's implementation is a hide and seek game. That's because theoretically optimal size results in much degraded slowdown. There's also a huge amount of kangaroos computed first (gridSize x 512) and you can end up with resource overload.

Since his kernel compiles to 96 registers (ccap 8.9 for RTX 4090) maybe try a grid like (128 * 2) x 512 so each SM fits 2 blocks each of 512 threads. That's close to 70 million kangaroos.

What problem was in DPs, except for the fact that there's all the chances in the world they might be too many of them and get lost?

So 128*2 * 512 = 131K threads and each thread processes a group of 128 kangs, right? It's 16.8M kangs, correct? And speed is about 3GH?

Yeah, sorry, forgot I had changed the group size to 512, though it only impacts the initial time to compute the kangaroos, not the kernel performance. Some people here claim to get to 4 Gk/s on a RTX 4090, I did not investigate too much what the best grid size would actually be, I gave up trying as soon as the GPU -check flag was filling the stdout with "warning: xxxxx DPs were lost" and "item not found"...
?
Activity: -
Merit: -
October 20, 2024, 11:06:47 AM
#32
No, I don't mean the speed itself, the problem is in DPs.
To demonstrate it, I need to know best speed parameters for 4090 with JLP's app.
When I start Kangaroo.exe with default parameters it shows "Grid(256x128)" and speed is about 1.8GH. I remember that speed can be better if use bigger grid, but don't remember best values. I think nobody uses it with default grid size if there are better values.
If you know them, I will explain the problem with exact numbers.

Optimal grid size in JLP's implementation is a hide and seek game. That's because theoretically optimal size results in much degraded slowdown. There's also a huge amount of kangaroos computed first (gridSize x 512) and you can end up with resource overload.

Since his kernel compiles to 96 registers (ccap 8.9 for RTX 4090) maybe try a grid like (128 * 2) x 512 so each SM fits 2 blocks each of 512 threads. That's close to 70 million kangaroos.

What problem was in DPs, except for the fact that there's all the chances in the world they might be too many of them and get lost?

So 128*2 * 512 = 131K threads and each thread processes a group of 128 kangs, right? It's 16.8M kangs, correct? And speed is about 3GH?
?
Activity: -
Merit: -
October 20, 2024, 09:09:12 AM
#31
No, I don't mean the speed itself, the problem is in DPs.
To demonstrate it, I need to know best speed parameters for 4090 with JLP's app.
When I start Kangaroo.exe with default parameters it shows "Grid(256x128)" and speed is about 1.8GH. I remember that speed can be better if use bigger grid, but don't remember best values. I think nobody uses it with default grid size if there are better values.
If you know them, I will explain the problem with exact numbers.
?
Activity: -
Merit: -
October 20, 2024, 06:02:46 AM
#30
LLE. Damn, you made me go back to the drawing board, just thinking on what you might have used to get to such speeds. For some of the things I came up with as explanation, there's literally just a couple of results on the entire web, but if those work as advertised, I may get a 3x speedup as well, just by freeing up a lot of registers that keep useless information. There's no other way I see, since I already got 100% compute throughput in the CUDA profiler but I'm nowhere near your speed. Great, now I also have a headache.

Don't work so hard, there is no reason for that Smiley
One more tip for you. As far as I remember there is one more problem in original JLP GPU code: every GPU thread processes A LOT of kangs. It will cause an issue with DPs for high-number puzzles like #130. It's obvious but it seems nobody understands it. Don't you see it too?
member
Activity: 165
Merit: 26
October 17, 2024, 07:38:11 AM
#29
#120 was pretty easy, just 20pcs of 4090 for six weeks.

Let's say you needed 1.36 sqrt(n) ops, that would mean

Code:
1.36*sqrt(2**119)/(14*86400)/20

op/s per each GPU, which is close to 46 Gk/s

Now, a 4090 can do at max 82 Tflop/s

That's 1782 float32 ops for a single jump.

For a jump you'd need around 6 256-bit muls and 6 additions (if inversions amortized to negligible).

To do a 256-bit mod mul you'd need around 8 int32 * 8 int32 = 64 32-bit multiplications and 56 32-bit additions = 119 operations just for the first step and around another 18 mul + a few adds or so to reduce mod N.
For a single 256-bit addition you'd need around 8 operations (16 to do mod N).

That adds up to around 170 or so ops per 256-bit mul and 16 per add.

Estimated cost: 170 * 6 + 16 * 6 = 1116 ops.

Theoretically, feasible since 1116 < 1782, but are you some kind of a genius or WTF? I didn't even put in the rest of the overhead work required.

LE: ok, I f**ed up. I noticed you said six weeks.

Still, that ends up as 15 Gk/s which is crazy fast.

---

LLE. Damn, you made me go back to the drawing board, just thinking on what you might have used to get to such speeds. For some of the things I came up with as explanation, there's literally just a couple of results on the entire web, but if those work as advertised, I may get a 3x speedup as well, just by freeing up a lot of registers that keep useless information. There's no other way I see, since I already got 100% compute throughput in the CUDA profiler but I'm nowhere near your speed. Great, now I also have a headache.
?
Activity: -
Merit: -
October 17, 2024, 07:09:58 AM
#28
Most people here don't have access to/possibility to rent much computing power, so it doesn't matter to them whether the K is 2.0, 1.7 or 1.2. That's why attempts to do some kind of magic begin.  

Yes, most people don't have resources to solve a high-number puzzles, but it does not mean that they should use magic. It's waste of time unless you are real genius who can break EC, but in this case you don't need to solve puzzles, you have better things to do. So people waste their time for nothing, better buy lottery tickets.

How many GPUs did you use to solve 120-130?

#120 was pretty easy, just 20pcs of 4090 for six weeks.

By the way, I was very surprised when I saw how quickly kangaroos get into a primitive two-point cycle.

It's not rocket science: on average every 2*SizeOfListOfJumps jumps.

Problem is on dealing with cycles on a GPU, oh well, let me just say that having even a couple of registers used for something else than performing some parts of some jump of some kangaroo, can decrease the speed by well over 100%. Not to mention the extra required access to memory when the registers are all full (to "deal with the cycles" one way or the other) bringing the speed down even further.

You just don't have enough skills in GPU coding.
member
Activity: 165
Merit: 26
October 17, 2024, 03:58:24 AM
#27
Fruitless cycles is the main problem of using equivalence, but there is few good ways to deal with them. By the way, I was very surprised when I saw how quickly kangaroos get into a primitive two-point cycle.

How does "good ways" look like? Without decreasing the overall speed more than ~35% (that would be the point at which dealing with the cycles has no benefits over not having them at all, e.g going from 1.36 to 1.72).

Of course when you're running on a CPU, dealing with the cycles is almost un-noticeable, so it's worth it, and can indeed bring down the complexity to 1.36 or even lower (1.25 at the best).But this benefit is quickly offset by the much more performant GPU devices.

Problem is on dealing with cycles on a GPU, oh well, let me just say that having even a couple of registers used for something else than performing some parts of some jump of some kangaroo, can decrease the speed by well over 100%. Not to mention the extra required access to memory when the registers are all full (to "deal with the cycles" one way or the other) bringing the speed down even further.

Not sure of RetiredCoder's method, technically if he solved 120 as he said ("crazy fast kangaroos") then he had no cycles to deal with at all. Or he found some clever way to deal with them in a way that doesn't greatly bring a GPU to a halt... but I'm not counting on that, it would be a waste of resources IMHO.
member
Activity: 110
Merit: 61
October 17, 2024, 02:12:30 AM
#26
Such papers is just a way to get a degree, zero new ideas.
And all this is not important actually.
The main problem that you won't get better than K=1.7 this way even if you use 1000 instead of 3-4-5-7 kangs.
K=1.7 is boring, really. It's like solving puzzles #6x  Grin
Why don't you try this? https://www.iacr.org/archive/pkc2010/60560372/60560372.pdf
Going this way (a long way btw) you can "invent" something interesting and get real K (not in theory but real) around 1.2 if you are smart enough Cheesy I call this approach "mirrored" kangs.
But, of course, this way is not so easy as "normal" kangs that always jump in one direction.
Just a tip for you because it seems you read some papers instead of creating magic circles/code as most people here Cheesy
Though I doubt that you will break #135, it's always fun to invent and discover so you can spend a lot of time having fun...

Most people here don't have access to/possibility to rent much computing power, so it doesn't matter to them whether the K is 2.0, 1.7 or 1.2. That's why attempts to do some kind of magic begin. How many GPUs did you use to solve 120-130?

Fruitless cycles is the main problem of using equivalence, but there is few good ways to deal with them. By the way, I was very surprised when I saw how quickly kangaroos get into a primitive two-point cycle.

Also, thanks for sharing 120 PK.
member
Activity: 165
Merit: 26
October 11, 2024, 04:29:54 AM
#25
John M. Pollard: “Randomness in these algorithms is crucial for their functioning.”

Impress me a little. First, where is that citation?

Secondly, if you make claims, it is not our responsibility to prove you are wrong, but YOUR responsibility to prove you are right.

But I'll make an exception for you, it is your lucky day Smiley

So, how exactly do you use randomness in your Kangaroo understanding of how the algorithm works?

And how exactly you break the birthday paradox, and ruin the stuff entirely, let us understand what do you mean by "making the algorithm more deterministic" by "removing the randomness"? What randomness to remove in the first place? Where is that located, and how do you mess around with it, to be able to claim that you create determinism out of uncontrollable chaos?


By applying precomputation in the Kangaroo algorithm, greater efficiency is achieved because precomputed intermediate results are reused instead of recalculating them each time. This reduces the number of necessary operations and, consequently, the computation time.

However, by reducing the number of operations, the probability of collisions also decreases. This is because the birthday paradox is based on the probability that two elements in a large set will coincide. With fewer operations, there are fewer opportunities for collisions to occur.

What mental institution did you ask ChatGPT to pretend to come out from before producing these paragraphs? I cannot find a single phrase in this quote that makes common sense. Did you even provide the correct tokens to GPT, to begin with? Seems like you asked something related to some birthday paradox, and made GPT hallucinate a little about what precomputing means for kangaroo, while this topic discusses totally different things. Maybe read some actual real-life studies before making a shame of yourself again.
jr. member
Activity: 47
Merit: 12
gmaxwell creator of 1000 BTC puzzl + Pinapple fund
October 10, 2024, 08:52:39 PM
#24
Why do people think they are smarter than studied people that spent many month with big brains on this topic?
Kangaroo cannot be optimized more than what you can find publicly.
Accept YOU won't crack 135bit with kangaroo.

Edit: Don't bother with some ~10% performance optimizations. Maybe some nerds care.
But in the end you feed this to the big players so they can crack it. Good job!
newbie
Activity: 24
Merit: 2
October 10, 2024, 06:06:34 PM
#23
John M. Pollard: “Randomness in these algorithms is crucial for their functioning.”

Impress me a little. First, where is that citation?

Secondly, if you make claims, it is not our responsibility to prove you are wrong, but YOUR responsibility to prove you are right.

But I'll make an exception for you, it is your lucky day Smiley

So, how exactly do you use randomness in your Kangaroo understanding of how the algorithm works?

And how exactly you break the birthday paradox, and ruin the stuff entirely, let us understand what do you mean by "making the algorithm more deterministic" by "removing the randomness"? What randomness to remove in the first place? Where is that located, and how do you mess around with it, to be able to claim that you create determinism out of uncontrollable chaos?


By applying precomputation in the Kangaroo algorithm, greater efficiency is achieved because precomputed intermediate results are reused instead of recalculating them each time. This reduces the number of necessary operations and, consequently, the computation time.

However, by reducing the number of operations, the probability of collisions also decreases. This is because the birthday paradox is based on the probability that two elements in a large set will coincide. With fewer operations, there are fewer opportunities for collisions to occur.
member
Activity: 165
Merit: 26
October 10, 2024, 05:44:36 PM
#22
John M. Pollard: “Randomness in these algorithms is crucial for their functioning.”

Impress me a little. First, where is that citation?

Secondly, if you make claims, it is not our responsibility to prove you are wrong, but YOUR responsibility to prove you are right.

But I'll make an exception for you, it is your lucky day Smiley

So, how exactly do you use randomness in your Kangaroo understanding of how the algorithm works?

And how exactly you break the birthday paradox, and ruin the stuff entirely, let us understand what do you mean by "making the algorithm more deterministic" by "removing the randomness"? What randomness to remove in the first place? Where is that located, and how do you mess around with it, to be able to claim that you create determinism out of uncontrollable chaos?
Pages:
Jump to: