Pages:
Author

Topic: Pollard's kangaroo ECDLP solver - page 67. (Read 60189 times)

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
March 21, 2021, 11:49:19 PM
Would someone explain in simple terms what this kangaroo tool does exactly? Thanks.

Basically we are generating a tame/wild "kangaroo" pair, which is actually just two sets of starting points. The starting point will be different depending on whether the kangaroo is labeled "wild" or "tame".

Then add [this is point addition we're talking here] slightly different incremental terms (mod the secp256k1 group order) until they either equal each other or we have added so much term value that it exceeds the search interval.

i.e if the interval is 2^40 we can only point add that much to the starting point before we have to discard the pair and start with a new one.

To speed things up we start hundreds of kangaroo pairs in parallel and see if any wild kangaroo equals any any tame kangaroo.

This program only stores the x coordinate of points do it might look like it's just storing and adding numbers, but actually the y point is computed from the x point so it's a compact way of storing points.
newbie
Activity: 12
Merit: 0
March 21, 2021, 09:01:49 PM
Would someone explain in simple terms what this kangaroo tool does exactly? Thanks.
jr. member
Activity: 56
Merit: 26
March 21, 2021, 03:41:57 PM
So you said you used 4 cores, how many "kangaroos" per core/thread were you running?

one kangaroo per core.

full member
Activity: 1232
Merit: 242
Shooters Shoot...
March 21, 2021, 03:31:33 PM

So each tame or wild thread, must jump, check for dp, if dp send to hashtable and then jump again, if no dp, jump again, rinse and repeat.

Yes !

In your test, you merely walked through 1 billion random points, correct?

1 billions points of random walk determined by the formula :
jD=[G,2*G,4*G...2^n-1*G]
 tamePos[j+1] =  tamePos[j] + jD[tamePos[j].x % n]    (like in Jean-Luc Pons algorithm explanation) j
So you said you used 4 cores, how many "kangaroos" per core/thread were you running?
jr. member
Activity: 56
Merit: 26
March 21, 2021, 03:23:13 PM

So each tame or wild thread, must jump, check for dp, if dp send to hashtable and then jump again, if no dp, jump again, rinse and repeat.

Yes !

In your test, you merely walked through 1 billion random points, correct?

1 billions points of random walk determined by the formula :
jD=[G,2*G,4*G...2^n-1*G]
 tamePos[j+1] =  tamePos[j] + jD[tamePos[j].x % n]    (like in Jean-Luc Pons algorithm explanation) j
full member
Activity: 1232
Merit: 242
Shooters Shoot...
March 21, 2021, 03:05:54 PM
So your speed (30MKey/s) does not include actually searching for/finding and writing/storing distinguished points to RAM/file?

Exactly, but i'don't think it will decreases dramatically the performance.

in the way i'm imagine the system, the clients (lots of ARM CPUs) will be only dedicated to be wild or tame kangaroo and to performs pure increments in the random walk.

If a distinguished point is found (i've to find a way to  be faster as possible) : example (AND mask beetween the first   of the 5 limbs (5*52bits)of X  and the desired DP,
 the client   send the X coordinate through a socket to the centralized server (computer with and hashtable checking continuously if a collision is found between received x coordinate and previously ones ).
I think it will  not slow significantly the computing because a client send a message to the server  only 1 on 2^DP times on average and a the stop of computing during the socket communication can be optimised to be fast.
 

I don't know your setup so I'm just asking questions.

So each tame or wild thread, must jump, check for dp, if dp send to hashtable and then jump again, if no dp, jump again, rinse and repeat.

In your test, you merely walked through 1 billion random points, correct?
jr. member
Activity: 56
Merit: 26
March 21, 2021, 02:56:45 PM
So your speed (30MKey/s) does not include actually searching for/finding and writing/storing distinguished points to RAM/file?

Exactly, but i'don't think it will decreases dramatically the performance.

in the way i'm imagine the system, the clients (lots of ARM CPUs) will be only dedicated to be wild or tame kangaroo and to performs pure increments in the random walk.

If a distinguished point is found (i've to find a way to  be faster as possible) : example (AND mask beetween the first   of the 5 limbs (5*52bits)of X  and the desired DP,
 the client   send the X coordinate through a socket to the centralized server (computer with and hashtable checking continuously if a collision is found between received x coordinate and previously ones ).
I think it will  not slow significantly the computing because a client send a message to the server  only 1 on 2^DP times on average and a the stop of computing during the socket communication can be optimised to be fast.
 
full member
Activity: 1232
Merit: 242
Shooters Shoot...
March 21, 2021, 02:26:04 PM


If you manage to push it a little over 30MKeys/s (like by using Neon instruction set) then you can match GPU performance.

In fact this speed is achieved with a modified version of the secp256k1 library of bitcoin core where every verification instructions have been removed for field and group function.
The library field_5*52.h  (instead of 10*26 for x86 cpu) (limbs registers) is automatically include for 64bits architecture and you've right it's maybe 10x faster.

I will study the NEON instructions to improve but it will be difficult because this benchmark doesn't include the DP detection code.
It's only a pure random walk benchmark on 1 Billion points.

 
So your speed (30MKey/s) does not include actually searching for/finding and writing/storing distinguished points to RAM/file?
jr. member
Activity: 56
Merit: 26
March 21, 2021, 01:42:39 PM


If you manage to push it a little over 30MKeys/s (like by using Neon instruction set) then you can match GPU performance.

In fact this speed is achieved with a modified version of the secp256k1 library of bitcoin core where every verification instructions have been removed for field and group function.
The library field_5*52.h  (instead of 10*26 for x86 cpu) (limbs registers) is automatically include for 64bits architecture and you've right it's maybe 10x faster.

I will study the NEON instructions to improve but it will be difficult because this benchmark doesn't include the DP detection code.
It's only a pure random walk benchmark on 1 Billion points.

 
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
March 21, 2021, 01:01:33 PM
The first benchmark on ARM (Raspberry PI 400 without overclocking ) are better than I expected (around 30MKeys/s with 4 cores).

That's an impressive speed, considering that Xeon E3-1230 @3.2GHz with 4c/8t only does 10MKeys/s and that the CPU inside this is:

Code:
Broadcom BCM2711 quad-core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz

So clock speed and thread count is halved on this silicon but yet it's 3x faster. And this makes sense because x86_64 has 16 64-bit registers and 8 128-bit SSE registers (Kangaroo doesn't use AVX yet), while armv8-a (the arch used in Cortex A72) has 64 registers and 32 more of the wider registers.

x86 was always an inefficient arch anyway because of all the backward compatibility it had to preserve. With ARM it's like "recompile for our new generation or else" and stuff written for armv8 won't work on armv7 AFAIK.

Do you know if there is a very low cost card (mini PC) with ARM CPU (similar to raspberry pi compute module 4 see link below) to do the work (in a parallelized way) and see if the ratio (computing power)/price might be better compared to the latest GPU CUDA Compatible graphics card.

A 2080Ti costs $1000 and can do about ~1100MKeys/s per my guessing. A Raspberry Pi 4 costs $35 and you can buy up 28 of those for each 2080Ti, and all those Pi 4's combined can do 30x28=840 MKeys/s which is only a little slower than GPU's speed. If you manage to push it a little over 30MKeys/s (like by using Neon instruction set) then you can match GPU performance.
jr. member
Activity: 56
Merit: 26
March 21, 2021, 12:29:32 PM
Hi everybody,
I'm writing a  light beta library in C  to run Pollard kangaroo algorithm on ARM64 CPU.

I am at the very beginning of this project but i am able to do benchmark.

when the library will be functionnal i will posted it on github.

the library (a shared object file) will be callable by python with ctypes module
The herds  of kangaroos will be controlled directly with python (for convenience and ease).
The idea is to have a lot of clients running on ARM64 low cost CPU to compute parallelized secp256k1 points additions  .
The memory (DP  points of a client) will be returned with a client-server socket data transfer (server will have the main hashtable of DP ) as the Kangaroo software from Jean-Luc Pons.

The first benchmark on ARM (Raspberry PI 400 without overclocking ) are better than I expected (around 30MKeys/s with 4 cores).

What do you think about this performance ?
Do you know if there is a very low cost card (mini PC) with ARM CPU (similar to raspberry pi compute module 4 see link below)  to do the work (in a parallelized way) and see if the ratio (computing power)/price might be better compared to the latest GPU CUDA Compatible graphics card.

https://www.raspberrypi.org/products/compute-module-4/?variant=raspberry-pi-cm4001000



member
Activity: 406
Merit: 47
March 21, 2021, 06:54:39 AM

Did you do this first?
 

yes I use nano edit "Makefile"


legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
March 21, 2021, 06:00:59 AM
~snip

Did you do this first?

For some reason nobody updated the CUDA path so you need to edit the Makefile yourself and find the line that says

Code:
CUDA = /usr/local/cuda-8.0

And replace it with

Code:
CUDA = /usr/local/cuda

The CUDA installer creates a symlink at /usr/local/cuda to whatever the actual installed CUDA version is.
member
Activity: 406
Merit: 47
March 21, 2021, 04:19:44 AM


Did you just type "make" or "make gpu=1"? (What will work is "make gpu=1 ccap=52" or 61 or 75 or whatever your CUDA compute cap is)

By the way. WSL speed will be slower than just compiling it on Windows itself, because WSL emulates system calls. The GPUs will also not be recognized unless you install the Ubuntu(?) WSL drivers from NVIDIA's download page, the Windows drivers don't work from WSL.

I use "make" only
make can compile success and can run kangaroo but can not using option -gpu
show error GPU code not compiled, use -DWITHGPU when compiling.


Code:
make gpu=1
mkdir -p obj
cd obj &&       mkdir -p SECPK1
cd obj && mkdir -p GPU
/usr/local/cuda-8.0/bin/nvcc -maxrregcount=0 --ptxas-options=-v --compile --compiler-options -fPIC -ccbin /usr/bin/g++-4.8 -m64 -O2 -I/usr/local/cuda-8.0/include -gencode=arch=compute_,code=sm_ -o obj/GPU/GPUEngine.o -c GPU/GPUEngine.cu
make: /usr/local/cuda-8.0/bin/nvcc: Command not found
make: *** [Makefile:75: obj/GPU/GPUEngine.o] Error 127


Code:
make gpu=1 ccap=52
cd obj &&       mkdir -p SECPK1
cd obj && mkdir -p GPU
/usr/local/cuda-8.0/bin/nvcc -maxrregcount=0 --ptxas-options=-v --compile --compiler-options -fPIC -ccbin /usr/bin/g++-4.8 -m64 -O2 -I/usr/local/cuda-8.0/include -gencode=arch=compute_52,code=sm_52 -o obj/GPU/GPUEngine.o -c GPU/GPUEngine.cu
make: /usr/local/cuda-8.0/bin/nvcc: Command not found
make: *** [Makefile:75: obj/GPU/GPUEngine.o] Error 127

I will try on real Ubuntu on small harddisk try install


legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
March 21, 2021, 04:03:42 AM
I try this version
https://github.com/JeanLucPons/Kangaroo

on my linux subsystem (windows 10)

I compile with command  make success
make

and use
./kangaroo -gpu in.txt

I got message

GPU code not compiled, use -DWITHGPU when compiling.

Did you just type "make" or "make gpu=1"? (What will work is "make gpu=1 ccap=52" or 61 or 75 or whatever your CUDA compute cap is)

By the way. WSL speed will be slower than just compiling it on Windows itself, because WSL emulates system calls. The GPUs will also not be recognized unless you install the Ubuntu(?) WSL drivers from NVIDIA's download page, the Windows drivers don't work from WSL.
member
Activity: 406
Merit: 47
March 21, 2021, 12:27:45 AM
I try this version
https://github.com/JeanLucPons/Kangaroo

on my linux subsystem (windows 10)

I compile with command  make success
make

and use
./kangaroo -gpu in.txt

I got message

GPU code not compiled, use -DWITHGPU when compiling.

gpu meter not show

[GPU 0.00 MK/s]

How I can compile with gpu code?
full member
Activity: 1232
Merit: 242
Shooters Shoot...
March 21, 2021, 12:26:35 AM

For some reason nobody updated the CUDA path so you need to edit the Makefile yourself and find the line that says

Code:
CUDA = /usr/local/cuda-8.0

And replace it with

Code:
CUDA = /usr/local/cuda

The CUDA installer creates a symlink at /usr/local/cuda to whatever the actual installed CUDA version is.

Can kangaroo compile with CUDA 10.1 or 11 ?
I don't have cuda 8.0

Yes
member
Activity: 406
Merit: 47
March 21, 2021, 12:11:44 AM

For some reason nobody updated the CUDA path so you need to edit the Makefile yourself and find the line that says

Code:
CUDA = /usr/local/cuda-8.0

And replace it with

Code:
CUDA = /usr/local/cuda

The CUDA installer creates a symlink at /usr/local/cuda to whatever the actual installed CUDA version is.

Can kangaroo compile with CUDA 10.1 or 11 ?
I don't have cuda 8.0
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
March 20, 2021, 11:37:17 PM

Code:
# make gpu=1
cd obj &&mkdir -p SECPK1
cd obj && mkdir -p GPU
/usr/local/cuda-8.0/bin/nvcc -maxrregcount=0 --ptxas-options=-v --compile --compiler-options -fPIC -ccbin /usr/bin/g++-4.8 -m64 -O2 -I/usr/local/cuda-8.0/include -gencode=arch=compute_,code=sm_ -o obj/GPU/GPUEngine.o -c GPU/GPUEngine.cu
make: /usr/local/cuda-8.0/bin/nvcc: Command not found
make: *** [Makefile:75: obj/GPU/GPUEngine.o] Error 127


How I can compile kangaroo on windows?
upper command is compile on linux right?


For some reason nobody updated the CUDA path so you need to edit the Makefile yourself and find the line that says

Code:
CUDA = /usr/local/cuda-8.0

And replace it with

Code:
CUDA = /usr/local/cuda

The CUDA installer creates a symlink at /usr/local/cuda to whatever the actual installed CUDA version is.
member
Activity: 406
Merit: 47
March 20, 2021, 12:42:12 AM

Code:
# make gpu=1
cd obj &&mkdir -p SECPK1
cd obj && mkdir -p GPU
/usr/local/cuda-8.0/bin/nvcc -maxrregcount=0 --ptxas-options=-v --compile --compiler-options -fPIC -ccbin /usr/bin/g++-4.8 -m64 -O2 -I/usr/local/cuda-8.0/include -gencode=arch=compute_,code=sm_ -o obj/GPU/GPUEngine.o -c GPU/GPUEngine.cu
make: /usr/local/cuda-8.0/bin/nvcc: Command not found
make: *** [Makefile:75: obj/GPU/GPUEngine.o] Error 127


How I can compile kangaroo on windows?
upper command is compile on linux right?
Pages:
Jump to: