Pollard's kangaroo ECDLP solver - page 112.

Etar

sr. member

Activity: 654

Merit: 316

I am working on pool to solve keys together.
It is shema how it should be:

HelpServer and HelpClient needed to map each DP to a specific account.
As you can see in the next picture each sent DP is added to a specific account.

So far, the biggest problem is the verification of DP. The CPU is not able to check every DP, it simply does not have enough resources for this.
As an option, in order to prevent forged DPs, HelpServer can check a few percent of the sent.
And send to the ban if the client sends the wrong points.
Maybe someone knows how to easily and effectively check points, because multiplying the distance by G is a resource-intensive process.

WanderingPhilospher

full member

Activity: 1232

Merit: 242

Shooters Shoot...

Quote from: arulbero on June 02, 2020, 01:25:13 PM

Quote from: WanderingPhilospher on June 02, 2020, 01:11:08 PM

So the reason why the GPUs find solution faster, is by the sheer number of Kangaroos they bring to the hunt? Example, I can find solution 50 times faster with GPU versus CPU, but GPU has more than 50 times the amount of Kangaroos. Something like that?

Exactly.

For example look at the number of the kangaroos and at the speed for my cpu:

.\Kangaroo -t 4 .\in.txt
Start:49DCCFD96DC5DF56487436F5A1B18C4F5D34F65DDB48CB5E0000000000000000
Stop :49DCCFD96DC5DF56487436F5A1B18C4F5D34F65DDB48CB5EFFFFFFFFFFFFFFFF
Keys :2
Number of CPU thread: 4
Range width: 2^64
Number of random walk: 2^12.00 (Max DP=18)
DP size: 18 [0xFFFFC00000000000]
SolveKeyCPU Thread 0: 1024 kangaroos
SolveKeyCPU Thread 2: 1024 kangaroos
SolveKeyCPU Thread 1: 1024 kangaroos
SolveKeyCPU Thread 3: 1024 kangaroos
[17.67 MKey/s][GPU 0.00 MKey/s][Count 2^26.19][Dead 0][06s][1.1MB]

and for my gpu:

.\Kangaroo -gpu -t 4 .\in.txt
Start:49DCCFD96DC5DF56487436F5A1B18C4F5D34F65DDB48CB5E0000000000000000
Stop :49DCCFD96DC5DF56487436F5A1B18C4F5D34F65DDB48CB5EFFFFFFFFFFFFFFFF
Keys :2
Number of CPU thread: 4
Range width: 2^64
Number of random walk: 2^19.01 (Max DP=10)
DP size: 10 [0xFFC0000000000000]
SolveKeyCPU Thread 1: 1024 kangaroos
SolveKeyCPU Thread 3: 1024 kangaroos
SolveKeyCPU Thread 2: 1024 kangaroos
SolveKeyCPU Thread 0: 1024 kangaroos
GPU: GPU #0 Quadro M2200 (8x128 cores) Grid(16x256) (57.0 MB used)
SolveKeyGPU Thread GPU#0: creating kangaroos...
SolveKeyGPU Thread GPU#0: 2^19.00 kangaroos in 3848.2ms
[68.93 MKey/s][GPU 37.47 MKey/s][Count 2^29.12][Dead 0][14s][48.3MB]

CPU: 1024 kangaroos times 4, speed: 17.67 MKey/s
GPU: 2^19 kangaroos, speed: 69 MKey/s

kangaroo speed on cpu: 17.67 MKey/s / 4096 = 4314 key/s
kangaroo speed on gpu: 68.93 MKey/s / 2^19 = 131 key/s

In my case each kangaroo on cpu moves 33 faster than a kangaroo on the gpu.

But my cpu has 4096=2^12 kangaroos against 2^19 kangaroos on the gpu (128 less)

Ahhhh...makes sense. Thank you for taking the time to respond. I appreciate it.

COBRAS

member

Activity: 873

Merit: 22

$$P2P BTC BRUTE.JOIN NOW ! https://uclck.me/SQPJk

And everybody, do nor forget share your profit with Jean_Luc !!! And that's it, don't forget to share your profits with Jean_Luc !!! All normal people share your profits, every single person, don't forget where and where they are, don't forget their fucking and shitty country, and don't forget people like Jean_Luc who help you get out of your shitty country and poverty ...

COBRAS

member

Activity: 873

Merit: 22

$$P2P BTC BRUTE.JOIN NOW ! https://uclck.me/SQPJk

Quote from: MrFreeDragon on June 02, 2020, 02:38:13 PM

Quote from: COBRAS on June 02, 2020, 02:00:50 PM

-snip-
Lets Ford will be winner Grin

But all who have no 20 GPU will be lusers... Because Kangaroo not adapted for FPGA

All others have the luck! That means that no need to have all required DP... With the luck you can solve #115 even on single CPU core Roll Eyes

Buddy, you I think was say - all others have thr *uck ..... although of course this is unacceptable dirty and black humor, But this can bee throw with 90% proobabiity

MrFreeDragon

sr. member

Activity: 443

Merit: 350

Quote from: COBRAS on June 02, 2020, 02:00:50 PM

-snip-
Lets Ford will be winner Grin

But all who have no 20 GPU will be lusers... Because Kangaroo not adapted for FPGA

All others have the luck! That means that no need to have all required DP... With the luck you can solve #115 even on single CPU core Roll Eyes

COBRAS

member

Activity: 873

Merit: 22

$$P2P BTC BRUTE.JOIN NOW ! https://uclck.me/SQPJk

Quote from: Jean_Luc on June 02, 2020, 01:33:54 PM

Quote from: HardwareCollector on June 02, 2020, 01:04:22 PM

Good luck and you guys have way more computing power that you are willing to commit. Cool

I am going straight for the knowledge, world record, and bragging rights for secp256k1 130-bit private key (129-bit interval). But I only have 4TB of RAM and still working on how to minimize storage below 16bytes/point with data compression techniques. Grin

May the best win Cheesy

Lets Ford will be winner Grin

But all who have no 20 GPU will be lusers... Because Kangaroo not adapted for FPGA

Jean_Luc

sr. member

Activity: 462

Merit: 701

Quote from: HardwareCollector on June 02, 2020, 01:04:22 PM

Good luck and you guys have way more computing power that you are willing to commit. Cool

I am going straight for the knowledge, world record, and bragging rights for secp256k1 130-bit private key (129-bit interval). But I only have 4TB of RAM and still working on how to minimize storage below 16bytes/point with data compression techniques. Grin

May the best win Cheesy

arulbero

legendary

Activity: 1968

Merit: 2130

Quote from: WanderingPhilospher on June 02, 2020, 01:11:08 PM

So the reason why the GPUs find solution faster, is by the sheer number of Kangaroos they bring to the hunt? Example, I can find solution 50 times faster with GPU versus CPU, but GPU has more than 50 times the amount of Kangaroos. Something like that?

Exactly.

For example look at the number of the kangaroos and at the speed for my cpu:

.\Kangaroo -t 4 .\in.txt
Start:49DCCFD96DC5DF56487436F5A1B18C4F5D34F65DDB48CB5E0000000000000000
Stop :49DCCFD96DC5DF56487436F5A1B18C4F5D34F65DDB48CB5EFFFFFFFFFFFFFFFF
Keys :2
Number of CPU thread: 4
Range width: 2^64
Number of random walk: 2^12.00 (Max DP=18)
DP size: 18 [0xFFFFC00000000000]
SolveKeyCPU Thread 0: 1024 kangaroos
SolveKeyCPU Thread 2: 1024 kangaroos
SolveKeyCPU Thread 1: 1024 kangaroos
SolveKeyCPU Thread 3: 1024 kangaroos
[17.67 MKey/s][GPU 0.00 MKey/s][Count 2^26.19][Dead 0][06s][1.1MB]

and for my gpu:

.\Kangaroo -gpu -t 4 .\in.txt
Start:49DCCFD96DC5DF56487436F5A1B18C4F5D34F65DDB48CB5E0000000000000000
Stop :49DCCFD96DC5DF56487436F5A1B18C4F5D34F65DDB48CB5EFFFFFFFFFFFFFFFF
Keys :2
Number of CPU thread: 4
Range width: 2^64
Number of random walk: 2^19.01 (Max DP=10)
DP size: 10 [0xFFC0000000000000]
SolveKeyCPU Thread 1: 1024 kangaroos
SolveKeyCPU Thread 3: 1024 kangaroos
SolveKeyCPU Thread 2: 1024 kangaroos
SolveKeyCPU Thread 0: 1024 kangaroos
GPU: GPU #0 Quadro M2200 (8x128 cores) Grid(16x256) (57.0 MB used)
SolveKeyGPU Thread GPU#0: creating kangaroos...
SolveKeyGPU Thread GPU#0: 2^19.00 kangaroos in 3848.2ms
[68.93 MKey/s][GPU 37.47 MKey/s][Count 2^29.12][Dead 0][14s][48.3MB]

CPU: 1024 kangaroos times 4, speed: 17.67 MKey/s
GPU: 2^19 kangaroos, speed: 69 MKey/s

kangaroo speed on cpu: 17.67 MKey/s / 4096 = 4314 key/s
kangaroo speed on gpu: 68.93 MKey/s / 2^19 = 131 key/s

In my case each kangaroo on cpu moves 33 faster than a kangaroo on the gpu.

But my cpu has 4096=2^12 kangaroos against 2^19 kangaroos on the gpu (128 less)

HardwareCollector

member

Activity: 144

Merit: 10

Quote from: WanderingPhilospher on June 02, 2020, 01:06:49 PM

How many Chrome tabs can you open with 4TB or RAM

Not on a single server, a distributed hash table on 8 servers with 512GB per server.

WanderingPhilospher

full member

Activity: 1232

Merit: 242

Shooters Shoot...

Quote from: arulbero on June 02, 2020, 11:47:01 AM

Quote from: WanderingPhilospher on June 01, 2020, 02:37:48 PM

DP count is most important. You are running at DP 25, so it will take 2^57 (whatever expected group ops is for 115) - 2^25 (your DP setting) so roughly you want your DP count up to 2^32 to be getting close to solving.

I have files for DP 30. So I need to get close to 2^27 to be getting close to solving. 2^57(expected ops) - 2^30(my DP setting) = 2^27 DP count .

To be precise, 2^57(expected ops) / 2^30(my DP setting) = 2^(57-30) = 2^27 DP count .

There are 3 phases in this algorithm:

phase 1)

to generate a collision, you need to form on average N couples of points (T, W) to get 1 couple with the same point, a collision (exploiting the birthday paradox)

that because in a space of N points there are N possible couples with the same point among N^2 possible couples, then the chance to get a collision each time you generate a couple (T, W) is N/N^2 = 1/N , thefor you need N tries to get a collision.

To form N couple, you need 2 lists of sqrt(N) points each (because sqrt(N) * sqrt(N) = N)

This is the heaviest part of the algorithm, 2*sqrt(N) steps; thefor the idea is to parallelize this task, and at this stage parallelization is ok, especially for the gpus.

phase 2)

you cannot compare each couple of points you generate, because there are N possible combinations before to get a collision, too much;

the idea is to choose each jump in such a way that the next point depends only by the current point, this trick makes an occasional collision between 2 kangaroos permanent and we don't need to check every point against all the others to know if a collision has happened.

To 'fix' a delay between the moment of the collision (after about 2*sqrt(N) steps) and the moment of the detection of the collision we store the distinguished points. On average a 30 bit distinguished point is met each 2^30 steps. Only then we can know if that kangaroo has collide with another one.

These phase is strange: if you are using a cpu, a delay of 2^30 steps is nothing, a single core of a cpu can perform 2^30 steps in a few minutes.

But for the slow gpu is not the same, like BitCrack noted:

Quote from: BitCrack on June 02, 2020, 07:32:46 AM

That 2^27 figure assumes the average walk length is 2^30. The GPU works by doing many slow walks in parallel e.g. 60 million walks that do 20 iterations per second. At that rate, it will take 2^30 / 20 seconds = 1.7 years before any of the walks are 2^30. Your DP count is going to be a few powers of 2 higher than 27.

On the other hand, the choice of the distinguished points is not only about a delay, it is about the storage too, because we can't store 2*sqrt(N) points (if we could, we would use the BSGS algorithm, that is faster and that finds the key surely).

With a high DP value we realize a low frequent sampling, then we have few samples to store in the hash table, but we waste a lot of time with the gpu (overhead).

With a lower DP value we realize a high frequent sampling, but we have to deal with the limit of our RAM.

You can see at the DP value at this way too: how much you need to reduce the points (among the 2*sqrt(N) points generated) to put in the hash table ?

DP = 2^30 means you choose to reduce by a factor of 2^30 the storage of the 2^57 points needed the get a collision (2^57 / 2^30 = 2^27 DP in the hash table). But in this way you delay by 2^30 steps the detection of the collision (and you increase by a factor of 2^30 the waste of the computation power, 2^30 steps * number of kangaroos in parallel, all these steps are useless because they are realized after the collision has already happened)

In the Zielar's case, from what I understood, there were about 2^33 kangaroos in parallel and DP = 22, N = 2^109 then:

total steps: 2*sqrt(2^109) = 2^55.5

hash table: 2^55.5 / 2^22 = 2^33.5

number of steps generated by each computing unit in parallel: 2^55.5 / 2^33 = 2^22.5, then each computing unit has generated on average 1,4 kangaroo (1,4 walk from start to the end point, DP)

phase 3)

When we know which wild kangaroo collides, if we don't know the distance covered by that kangaroo we only need to redo the steps of that kangaroo until the DP, and with a cpu this work can be rapidily finished.

So the reason why the GPUs find solution faster, is by the sheer number of Kangaroos they bring to the hunt? Example, I can find solution 50 times faster with GPU versus CPU, but GPU has more than 50 times the amount of Kangaroos. Something like that?

WanderingPhilospher

full member

Activity: 1232

Merit: 242

Shooters Shoot...

Quote from: HardwareCollector on June 02, 2020, 01:04:22 PM

Quote from: Jean_Luc on June 02, 2020, 12:47:26 PM

This is a race, we keep few settings secret.
We will probably end to fight after #120.
However now news from Zielar, he is probably sleeping Cheesy

Good luck and you guys have way more computing power that you are willing to commit. Cool

I am going straight for the knowledge, world record, and bragging rights for secp256k1 130-bit private key (129-bit interval). But I only have 4TB of RAM and still working on how to minimize storage below 16bytes/point with data compression techniques. Grin

How many Chrome tabs can you open with 4TB or RAM

WanderingPhilospher

full member

Activity: 1232

Merit: 242

Shooters Shoot...

Quote from: WhyMe on June 02, 2020, 07:26:57 AM

Quote from: WanderingPhilospher on June 02, 2020, 02:19:23 AM

Quote from: Jean_Luc on June 02, 2020, 01:40:52 AM

Quote from: arulbero on May 31, 2020, 05:51:55 AM

Can you explain yourself better? What is 'Ygrid' ?

Ygrid is the second coordinate of the GPU thread array.
It affects performance. This is very useful for linear algebra because you can use thread coordinates to make fast matrix calculus.
In our case, this is just used to tune performance.

I was in weekend, Yesterday was also a day off in France, I restart the work today. I also have professional work to perform but I will continue to make update on github when I can for everybody. I will try to integrate ASAP the mods from PatatasFritas and the support of -ws for the client mode. News from AndrewBrz who is progressing well with the OpenCL kernel.

Zielar solved #110 with official 1.8 and 1.9alpha for merging.

It'll be interesting to see what my Vega VIIs can do. They murder all the RTX cards when mining ethash (90 to 100 Mh/s compared to 40-55 Mh/s). I know mining is different but we shall see how they perform when it comes to this program. Hopefully Andrew has success, at least for himself.

Someone is working on opencl support ?

Yes, that link JLP posted to his Github, under the Issues tab, Optimization subject...AndrewBrz(?). He posted a pic of the progress he has made so far.

HardwareCollector

member

Activity: 144

Merit: 10

Quote from: Jean_Luc on June 02, 2020, 12:47:26 PM

This is a race, we keep few settings secret.
We will probably end to fight after #120.
However now news from Zielar, he is probably sleeping Cheesy

Good luck and you guys have way more computing power that you are willing to commit. Cool

I am going straight for the knowledge, world record, and bragging rights for secp256k1 130-bit private key (129-bit interval). But I only have 4TB of RAM and still working on how to minimize storage below 16bytes/point with data compression techniques. Grin

COBRAS

member

Activity: 873

Merit: 22

$$P2P BTC BRUTE.JOIN NOW ! https://uclck.me/SQPJk

Quote from: Jean_Luc on June 02, 2020, 12:47:26 PM

We will probably end to fight after #120.

Jean_Luc

sr. member

Activity: 462

Merit: 701

This is a race, we keep few settings secret.
We will probably end to fight after #120.
However now news from Zielar, he is probably sleeping Cheesy

arulbero

legendary

Activity: 1968

Merit: 2130

Quote from: Jean_Luc on June 02, 2020, 12:25:36 PM

Zielar solved #110 by merging 2 files of 2^29.55 DP each = 2^30.55 => DP25 => total of 2^(30.55 + 25) operations = ~2^(109/2+1) , a little bit after the average.

Ok, then:

total steps: 2^55.55 = 1.035 * 2*sqrt(N) (3,5% after the average)

hash table: 2^55.55 / 2^25 = 2^30.55

how many kangaroos were used in parallel?

I suppose about 2^30, probably more; that means that each computing unit has generated only 1 DP on average?

Jean_Luc

sr. member

Activity: 462

Merit: 701

Zielar solved #110 by merging 2 files of 2^29.55 DP each = 2^30.55 => DP25 => total of 2^(30.55 + 25) operations = ~2^(109/2+1) , a little bit after the average.

arulbero

legendary

Activity: 1968

Merit: 2130

Someone can confirm these numbers? I'm curious:

Quote from: arulbero on June 02, 2020, 11:47:01 AM

In the Zielar's case, from what I understood, there were about 2^33 kangaroos in parallel and DP = 22, N = 2^109 then:

total steps: 2*sqrt(2^109) = 2^55.5

hash table: 2^55.5 / 2^22 = 2^33.5

number of steps generated by each computing unit in parallel: 2^55.5 / 2^33 = 2^22.5, then each computing unit has generated on average 1,4 kangaroo (1,4 walk from the start to the end point, DP) before to get a collision

How many steps to retrieve #110?

HardwareCollector

member

Activity: 144

Merit: 10

@arulbero

Well said, the information above should be very useful for those new to the area.

Also as it relates to FPGAs, those are more suited for binary curves; and GPUs such as RTX 2080Tis, RX Vega 64, and the likes are well suited for prime curves $ for $.

If you would like to play around with FPGAs and ECDSA (secp256k1), everything you need to start is right here:
https://github.com/ZcashFoundation/zcash-fpga

arulbero

legendary

Activity: 1968

Merit: 2130

Quote from: WanderingPhilospher on June 01, 2020, 02:37:48 PM

DP count is most important. You are running at DP 25, so it will take 2^57 (whatever expected group ops is for 115) - 2^25 (your DP setting) so roughly you want your DP count up to 2^32 to be getting close to solving.

I have files for DP 30. So I need to get close to 2^27 to be getting close to solving. 2^57(expected ops) - 2^30(my DP setting) = 2^27 DP count .

To be precise, 2^57(expected ops) / 2^30(my DP setting) = 2^(57-30) = 2^27 DP count .

There are 3 phases in this algorithm:

phase 1)

to generate a collision, you need to form on average N couples of points (T, W) to get 1 couple with the same point, a collision (exploiting the birthday paradox)

that because in a space of N points there are N possible couples with the same point among N^2 possible couples, then the chance to get a collision each time you generate a couple (T, W) is N/N^2 = 1/N , thefor you need N tries to get a collision.

To form N couple, you need 2 lists of sqrt(N) points each (because sqrt(N) * sqrt(N) = N)

This is the heaviest part of the algorithm, 2*sqrt(N) steps; thefor the idea is to parallelize this task, and at this stage parallelization is ok, especially for the gpus.

phase 2)

you cannot compare each couple of points you generate, because there are N possible combinations before to get a collision, too much;

the idea is to choose each jump in such a way that the next point depends only by the current point, this trick makes an occasional collision between 2 kangaroos permanent and we don't need to check every point against all the others to know if a collision has happened.

To 'fix' a delay between the moment of the collision (after about 2*sqrt(N) steps) and the moment of the detection of the collision we store the distinguished points. On average a 30 bit distinguished point is met each 2^30 steps. Only then we can know if that kangaroo has collide with another one.

These phase is strange: if you are using a cpu, a delay of 2^30 steps is nothing, a single core of a cpu can perform 2^30 steps in a few minutes.

But for the slow gpu is not the same, like BitCrack noted:

Quote from: BitCrack on June 02, 2020, 07:32:46 AM

That 2^27 figure assumes the average walk length is 2^30. The GPU works by doing many slow walks in parallel e.g. 60 million walks that do 20 iterations per second. At that rate, it will take 2^30 / 20 seconds = 1.7 years before any of the walks are 2^30. Your DP count is going to be a few powers of 2 higher than 27.

On the other hand, the choice of the distinguished points is not only about a delay, it is about the storage too, because we can't store 2*sqrt(N) points (if we could, we would use the BSGS algorithm, that is faster and that finds the key surely).

With a high DP value we realize a low frequent sampling, then we have few samples to store in the hash table, but we waste a lot of time with the gpu (overhead).

With a lower DP value we realize a high frequent sampling, but we have to deal with the limit of our RAM.

You can see at the DP value at this way too: how much you need to reduce the points (among the 2*sqrt(N) points generated) to put in the hash table ?

DP = 2^30 means you choose to reduce by a factor of 2^30 the storage of the 2^57 points needed the get a collision (2^57 / 2^30 = 2^27 DP in the hash table). But in this way you delay by 2^30 steps the detection of the collision (and you increase by a factor of 2^30 the waste of the computation power, 2^30 steps * number of kangaroos in parallel, all these steps are useless because they are realized after the collision has already happened)

In the Zielar's case, from what I understood, there were about 2^33 kangaroos in parallel and DP = 22, N = 2^109 then:

total steps: 2*sqrt(2^109) = 2^55.5

hash table: 2^55.5 / 2^22 = 2^33.5

number of steps generated by each computing unit in parallel: 2^55.5 / 2^33 = 2^22.5, then each computing unit has generated on average 1,4 kangaroo (1,4 walk from start to the end point, DP)

phase 3)

When we know which wild kangaroo collides, if we don't know the distance covered by that kangaroo we only need to redo the steps of that kangaroo until the DP, and with a cpu this work can be rapidily finished.

Topic: Pollard's kangaroo ECDLP solver - page 112. (Read 60654 times)