Pollard's kangaroo ECDLP solver - page 67.

fxsniper

member

Activity: 406

Merit: 47

I not understand in c++

reference from python code

tame and wild both random by using function random.randint(start,end)

problem in code is bits (puzzle 120 = 120 bit)

script generate tame and wild first

tame
t.append((3 << problem - 2) + random.randint(1, 1 << problem - 1))

wild
w.append(random.randint(1, 1 << problem - 1))

and use modulo with 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F
( 115792089237316195423570985008687907853269984665640564039457584007908834671663 )
for generate some hash in hex value and compare

tame-wild = private key

I understand overview but not yet know in detail have a lot value in code

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Of all the images you found, this one:

Quote from: fxsniper on March 24, 2021, 01:07:05 AM

most resembles what's actually going on in the program. It's just a random walk through different points.

Although now I sort of get why some people call this Pollard's lambda because the diagrams look like the Greek letter, but when two kangaroos collide which makes them go on the same trajectory, future points don't necessarily form a curve, but rather random lines like the above two.

@WanderingPhilosipher I observed a strange result after running your 80-bit range input, GPU speed suddenly tripled to 610MKeys/s and all the dead kangaroos and collisions disappeared.

It makes me think that JLPons' CUDA search routine doesn't run as fast with small ranges. Strangely my 8-thread Xeon exhausted the 40-bit range instantly, but when I tried the same range with 48 threads and a Tesla T4 it started taking forever to search it.

WanderingPhilospher

full member

Activity: 1232

Merit: 242

Shooters Shoot...

Quote

I search with keyword Pollards kangaroo got a lot of image some from keyword Pollards and some from keyword kangaroo

Are they same?

or kangaroo have multiple formula
in case if have multiple formula Can somebody help to try code other formula for testing?

The ones that form a circle of sorts looks like Pollard's Rho method, not Kangaroo

fxsniper

member

Activity: 406

Merit: 47

I search with keyword Pollards kangaroo got a lot of image some from keyword Pollards and some from keyword kangaroo

Are they same?

or kangaroo have multiple formula
in case if have multiple formula Can somebody help to try code other formula for testing?

fxsniper

member

Activity: 406

Merit: 47

Some Speed test for you kangaroo

What are you speed result?

recommend to test puzzle #70 and puzzle #75 not a long time

try testing time not over 5-10 minute

Kangaroo.exe -gpu -o result.txt in.txt

Puzzle #40
in.txt

Code:

8000000000
FFFFFFFFFF
03A2EFA402FD5268400C77C20E574BA86409EDEDEE7C4020E4B9F0EDBEE53DE0D4

1050 laptop ==> 02 second - 03 second
keyspace jump = 549755813888

Puzzle #60
in.txt

Code:

800000000000000
FFFFFFFFFFFFFFF
0348E843DC5B1BD246E6309B4924B81543D02B16C8083DF973A89CE2C7EB89A10D

1050 laptop ==> 20 second - 25 second
keyspace = 576460752303423488

Puzzle #63
in.txt

Code:

4000000000000000
7FFFFFFFFFFFFFFF
0365EC2994B8CC0A20D40DD69EDFE55CA32A54BCBBAA6B0DDCFF36049301A54579

1050 laptop ==> 30 second - 40 second
keyspace = 4611686018427387904

puzzle #65
in.txt

Code:

10000000000000000
1FFFFFFFFFFFFFFFF
0230210C23B1A047BC9BDBB13448E67DEDDC108946DE6DE639BCC75D47C0216B1B

1050 laptop ==> 2 minute
keyspace = 18446744073709551616

puzzle #70
in.txt

Code:

200000000000000000
3FFFFFFFFFFFFFFFFF
0290E6900A58D33393BC1097B5AED31F2E4E7CBD3E5466AF958665BC0121248483

1050 laptop ==> 4 minute - 6 minute solve
keyspace = 590295810358705651712

puzzle #75
in.txt

Code:

4000000000000000000
7FFFFFFFFFFFFFFFFFF
03726B574F193E374686D8E12BC6E4142ADEB06770E0A2856F5E4AD89F66044755

1050 laptop ==> 40 minute - 50 minute solve
keyspace = 18889465931478580854784

puzzle #80
in.txt

Code:

80000000000000000000
FFFFFFFFFFFFFFFFFFFF
037E1238F7B1CE757DF94FAA9A2EB261BF0AEB9F84DBF81212104E78931C2A19DC

(may be take 5-6 hour)
keyspace = 604462909807314587353088

puzzle #85
in.txt

Code:

1000000000000000000000
1FFFFFFFFFFFFFFFFFFFFF
0329C4574A4FD8C810B7E42A4B398882B381BCD85E40C6883712912D167C83E73A

puzzle #90
in.txt

Code:

20000000000000000000000
3FFFFFFFFFFFFFFFFFFFFFF
035C38BD9AE4B10E8A250857006F3CFD98AB15A6196D9F4DFD25BC7ECC77D788D5

puzzle #95
in.txt

Code:

400000000000000000000000
7FFFFFFFFFFFFFFFFFFFFFFF
02967A5905D6F3B420959A02789F96AB4C3223A2C4D2762F817B7895C5BC88A045

puzzle #100
in.txt

Code:

8000000000000000000000000
FFFFFFFFFFFFFFFFFFFFFFFFF
03D2063D40402F030D4CC71331468827AA41A8A09BD6FD801BA77FB64F8E67E617

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: DarkArrow on March 21, 2021, 09:01:49 PM

Would someone explain in simple terms what this kangaroo tool does exactly? Thanks.

Basically we are generating a tame/wild "kangaroo" pair, which is actually just two sets of starting points. The starting point will be different depending on whether the kangaroo is labeled "wild" or "tame".

Then add [this is point addition we're talking here] slightly different incremental terms (mod the secp256k1 group order) until they either equal each other or we have added so much term value that it exceeds the search interval.

i.e if the interval is 2^40 we can only point add that much to the starting point before we have to discard the pair and start with a new one.

To speed things up we start hundreds of kangaroo pairs in parallel and see if any wild kangaroo equals any any tame kangaroo.

This program only stores the x coordinate of points do it might look like it's just storing and adding numbers, but actually the y point is computed from the x point so it's a compact way of storing points.

DarkArrow

newbie

Activity: 12

Merit: 0

Would someone explain in simple terms what this kangaroo tool does exactly? Thanks.

CrunchyF

jr. member

Activity: 56

Merit: 26

Quote from: WanderingPhilospher on March 21, 2021, 03:31:33 PM

So you said you used 4 cores, how many "kangaroos" per core/thread were you running?

one kangaroo per core.

WanderingPhilospher

full member

Activity: 1232

Merit: 242

Shooters Shoot...

Quote from: CrunchyF on March 21, 2021, 03:23:13 PM

Quote from: WanderingPhilospher on March 21, 2021, 03:05:54 PM

So each tame or wild thread, must jump, check for dp, if dp send to hashtable and then jump again, if no dp, jump again, rinse and repeat.

Yes !

Quote from: WanderingPhilospher on March 21, 2021, 03:05:54 PM

In your test, you merely walked through 1 billion random points, correct?

1 billions points of random walk determined by the formula :
jD=[G,2*G,4*G...2^n-1*G]
tamePos[j+1] = tamePos[j] + jD[tamePos[j].x % n] (like in Jean-Luc Pons algorithm explanation) j

So you said you used 4 cores, how many "kangaroos" per core/thread were you running?

CrunchyF

jr. member

Activity: 56

Merit: 26

Quote from: WanderingPhilospher on March 21, 2021, 03:05:54 PM

So each tame or wild thread, must jump, check for dp, if dp send to hashtable and then jump again, if no dp, jump again, rinse and repeat.

Yes !

Quote from: WanderingPhilospher on March 21, 2021, 03:05:54 PM

In your test, you merely walked through 1 billion random points, correct?

1 billions points of random walk determined by the formula :
jD=[G,2*G,4*G...2^n-1*G]
tamePos[j+1] = tamePos[j] + jD[tamePos[j].x % n] (like in Jean-Luc Pons algorithm explanation) j

WanderingPhilospher

full member

Activity: 1232

Merit: 242

Shooters Shoot...

Quote from: CrunchyF on March 21, 2021, 02:56:45 PM

Quote from: WanderingPhilospher on March 21, 2021, 02:26:04 PM

So your speed (30MKey/s) does not include actually searching for/finding and writing/storing distinguished points to RAM/file?

Exactly, but i'don't think it will decreases dramatically the performance.

in the way i'm imagine the system, the clients (lots of ARM CPUs) will be only dedicated to be wild or tame kangaroo and to performs pure increments in the random walk.

If a distinguished point is found (i've to find a way to be faster as possible) : example (AND mask beetween the first of the 5 limbs (5*52bits)of X and the desired DP,
the client send the X coordinate through a socket to the centralized server (computer with and hashtable checking continuously if a collision is found between received x coordinate and previously ones ).
I think it will not slow significantly the computing because a client send a message to the server only 1 on 2^DP times on average and a the stop of computing during the socket communication can be optimised to be fast.

I don't know your setup so I'm just asking questions.

So each tame or wild thread, must jump, check for dp, if dp send to hashtable and then jump again, if no dp, jump again, rinse and repeat.

In your test, you merely walked through 1 billion random points, correct?

CrunchyF

jr. member

Activity: 56

Merit: 26

Quote from: WanderingPhilospher on March 21, 2021, 02:26:04 PM

So your speed (30MKey/s) does not include actually searching for/finding and writing/storing distinguished points to RAM/file?

Exactly, but i'don't think it will decreases dramatically the performance.

in the way i'm imagine the system, the clients (lots of ARM CPUs) will be only dedicated to be wild or tame kangaroo and to performs pure increments in the random walk.

If a distinguished point is found (i've to find a way to be faster as possible) : example (AND mask beetween the first of the 5 limbs (5*52bits)of X and the desired DP,
the client send the X coordinate through a socket to the centralized server (computer with and hashtable checking continuously if a collision is found between received x coordinate and previously ones ).
I think it will not slow significantly the computing because a client send a message to the server only 1 on 2^DP times on average and a the stop of computing during the socket communication can be optimised to be fast.

WanderingPhilospher

full member

Activity: 1232

Merit: 242

Shooters Shoot...

Quote from: CrunchyF on March 21, 2021, 01:42:39 PM

Quote from: NotATether on March 21, 2021, 01:01:33 PM

If you manage to push it a little over 30MKeys/s (like by using Neon instruction set) then you can match GPU performance.

In fact this speed is achieved with a modified version of the secp256k1 library of bitcoin core where every verification instructions have been removed for field and group function.
The library field_5*52.h (instead of 10*26 for x86 cpu) (limbs registers) is automatically include for 64bits architecture and you've right it's maybe 10x faster.

I will study the NEON instructions to improve but it will be difficult because this benchmark doesn't include the DP detection code.
It's only a pure random walk benchmark on 1 Billion points.

So your speed (30MKey/s) does not include actually searching for/finding and writing/storing distinguished points to RAM/file?

CrunchyF

jr. member

Activity: 56

Merit: 26

Quote from: NotATether on March 21, 2021, 01:01:33 PM

If you manage to push it a little over 30MKeys/s (like by using Neon instruction set) then you can match GPU performance.

In fact this speed is achieved with a modified version of the secp256k1 library of bitcoin core where every verification instructions have been removed for field and group function.
The library field_5*52.h (instead of 10*26 for x86 cpu) (limbs registers) is automatically include for 64bits architecture and you've right it's maybe 10x faster.

I will study the NEON instructions to improve but it will be difficult because this benchmark doesn't include the DP detection code.
It's only a pure random walk benchmark on 1 Billion points.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: CrunchyF on March 21, 2021, 12:29:32 PM

The first benchmark on ARM (Raspberry PI 400 without overclocking ) are better than I expected (around 30MKeys/s with 4 cores).

That's an impressive speed, considering that Xeon E3-1230 @3.2GHz with 4c/8t only does 10MKeys/s and that the CPU inside this is:

Code:

Broadcom BCM2711 quad-core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz

So clock speed and thread count is halved on this silicon but yet it's 3x faster. And this makes sense because x86_64 has 16 64-bit registers and 8 128-bit SSE registers (Kangaroo doesn't use AVX yet), while armv8-a (the arch used in Cortex A72) has 64 registers and 32 more of the wider registers.

x86 was always an inefficient arch anyway because of all the backward compatibility it had to preserve. With ARM it's like "recompile for our new generation or else" and stuff written for armv8 won't work on armv7 AFAIK.

Quote from: CrunchyF on March 21, 2021, 12:29:32 PM

Do you know if there is a very low cost card (mini PC) with ARM CPU (similar to raspberry pi compute module 4 see link below) to do the work (in a parallelized way) and see if the ratio (computing power)/price might be better compared to the latest GPU CUDA Compatible graphics card.

A 2080Ti costs $1000 and can do about ~1100MKeys/s per my guessing. A Raspberry Pi 4 costs $35 and you can buy up 28 of those for each 2080Ti, and all those Pi 4's combined can do 30x28=840 MKeys/s which is only a little slower than GPU's speed. If you manage to push it a little over 30MKeys/s (like by using Neon instruction set) then you can match GPU performance.

CrunchyF

jr. member

Activity: 56

Merit: 26

Hi everybody,
I'm writing a light beta library in C to run Pollard kangaroo algorithm on ARM64 CPU.

I am at the very beginning of this project but i am able to do benchmark.

when the library will be functionnal i will posted it on github.

the library (a shared object file) will be callable by python with ctypes module
The herds of kangaroos will be controlled directly with python (for convenience and ease).
The idea is to have a lot of clients running on ARM64 low cost CPU to compute parallelized secp256k1 points additions .
The memory (DP points of a client) will be returned with a client-server socket data transfer (server will have the main hashtable of DP ) as the Kangaroo software from Jean-Luc Pons.

The first benchmark on ARM (Raspberry PI 400 without overclocking ) are better than I expected (around 30MKeys/s with 4 cores).

What do you think about this performance ?
Do you know if there is a very low cost card (mini PC) with ARM CPU (similar to raspberry pi compute module 4 see link below) to do the work (in a parallelized way) and see if the ratio (computing power)/price might be better compared to the latest GPU CUDA Compatible graphics card.

https://www.raspberrypi.org/products/compute-module-4/?variant=raspberry-pi-cm4001000

fxsniper

member

Activity: 406

Merit: 47

Quote from: NotATether on March 21, 2021, 06:00:59 AM

Did you do this first?

yes I use nano edit "Makefile"

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: fxsniper on March 21, 2021, 04:19:44 AM

~snip

Did you do this first?

Quote from: NotATether on March 20, 2021, 11:37:17 PM

For some reason nobody updated the CUDA path so you need to edit the Makefile yourself and find the line that says

Code:

CUDA = /usr/local/cuda-8.0

And replace it with

Code:

CUDA = /usr/local/cuda

The CUDA installer creates a symlink at /usr/local/cuda to whatever the actual installed CUDA version is.

fxsniper

member

Activity: 406

Merit: 47

Quote from: NotATether on March 21, 2021, 04:03:42 AM

Did you just type "make" or "make gpu=1"? (What will work is "make gpu=1 ccap=52" or 61 or 75 or whatever your CUDA compute cap is)

By the way. WSL speed will be slower than just compiling it on Windows itself, because WSL emulates system calls. The GPUs will also not be recognized unless you install the Ubuntu(?) WSL drivers from NVIDIA's download page, the Windows drivers don't work from WSL.

I use "make" only
make can compile success and can run kangaroo but can not using option -gpu
show error GPU code not compiled, use -DWITHGPU when compiling.

Code:

make gpu=1
mkdir -p obj
cd obj && mkdir -p SECPK1
cd obj && mkdir -p GPU
/usr/local/cuda-8.0/bin/nvcc -maxrregcount=0 --ptxas-options=-v --compile --compiler-options -fPIC -ccbin /usr/bin/g++-4.8 -m64 -O2 -I/usr/local/cuda-8.0/include -gencode=arch=compute_,code=sm_ -o obj/GPU/GPUEngine.o -c GPU/GPUEngine.cu
make: /usr/local/cuda-8.0/bin/nvcc: Command not found
make: *** [Makefile:75: obj/GPU/GPUEngine.o] Error 127

Code:

make gpu=1 ccap=52
cd obj && mkdir -p SECPK1
cd obj && mkdir -p GPU
/usr/local/cuda-8.0/bin/nvcc -maxrregcount=0 --ptxas-options=-v --compile --compiler-options -fPIC -ccbin /usr/bin/g++-4.8 -m64 -O2 -I/usr/local/cuda-8.0/include -gencode=arch=compute_52,code=sm_52 -o obj/GPU/GPUEngine.o -c GPU/GPUEngine.cu
make: /usr/local/cuda-8.0/bin/nvcc: Command not found
make: *** [Makefile:75: obj/GPU/GPUEngine.o] Error 127

I will try on real Ubuntu on small harddisk try install

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: fxsniper on March 21, 2021, 12:27:45 AM

I try this version
https://github.com/JeanLucPons/Kangaroo

on my linux subsystem (windows 10)

I compile with command make success
make

and use
./kangaroo -gpu in.txt

I got message

GPU code not compiled, use -DWITHGPU when compiling.

Did you just type "make" or "make gpu=1"? (What will work is "make gpu=1 ccap=52" or 61 or 75 or whatever your CUDA compute cap is)

By the way. WSL speed will be slower than just compiling it on Windows itself, because WSL emulates system calls. The GPUs will also not be recognized unless you install the Ubuntu(?) WSL drivers from NVIDIA's download page, the Windows drivers don't work from WSL.

Topic: Pollard's kangaroo ECDLP solver - page 67. (Read 60698 times)