Pages:
Author

Topic: Pollard's kangaroo ECDLP solver - page 67. (Read 59389 times)

full member
Activity: 1162
Merit: 237
Shooters Shoot...
March 21, 2021, 03:05:54 PM
So your speed (30MKey/s) does not include actually searching for/finding and writing/storing distinguished points to RAM/file?

Exactly, but i'don't think it will decreases dramatically the performance.

in the way i'm imagine the system, the clients (lots of ARM CPUs) will be only dedicated to be wild or tame kangaroo and to performs pure increments in the random walk.

If a distinguished point is found (i've to find a way to  be faster as possible) : example (AND mask beetween the first   of the 5 limbs (5*52bits)of X  and the desired DP,
 the client   send the X coordinate through a socket to the centralized server (computer with and hashtable checking continuously if a collision is found between received x coordinate and previously ones ).
I think it will  not slow significantly the computing because a client send a message to the server  only 1 on 2^DP times on average and a the stop of computing during the socket communication can be optimised to be fast.
 

I don't know your setup so I'm just asking questions.

So each tame or wild thread, must jump, check for dp, if dp send to hashtable and then jump again, if no dp, jump again, rinse and repeat.

In your test, you merely walked through 1 billion random points, correct?
jr. member
Activity: 56
Merit: 26
March 21, 2021, 02:56:45 PM
So your speed (30MKey/s) does not include actually searching for/finding and writing/storing distinguished points to RAM/file?

Exactly, but i'don't think it will decreases dramatically the performance.

in the way i'm imagine the system, the clients (lots of ARM CPUs) will be only dedicated to be wild or tame kangaroo and to performs pure increments in the random walk.

If a distinguished point is found (i've to find a way to  be faster as possible) : example (AND mask beetween the first   of the 5 limbs (5*52bits)of X  and the desired DP,
 the client   send the X coordinate through a socket to the centralized server (computer with and hashtable checking continuously if a collision is found between received x coordinate and previously ones ).
I think it will  not slow significantly the computing because a client send a message to the server  only 1 on 2^DP times on average and a the stop of computing during the socket communication can be optimised to be fast.
 
full member
Activity: 1162
Merit: 237
Shooters Shoot...
March 21, 2021, 02:26:04 PM


If you manage to push it a little over 30MKeys/s (like by using Neon instruction set) then you can match GPU performance.

In fact this speed is achieved with a modified version of the secp256k1 library of bitcoin core where every verification instructions have been removed for field and group function.
The library field_5*52.h  (instead of 10*26 for x86 cpu) (limbs registers) is automatically include for 64bits architecture and you've right it's maybe 10x faster.

I will study the NEON instructions to improve but it will be difficult because this benchmark doesn't include the DP detection code.
It's only a pure random walk benchmark on 1 Billion points.

 
So your speed (30MKey/s) does not include actually searching for/finding and writing/storing distinguished points to RAM/file?
jr. member
Activity: 56
Merit: 26
March 21, 2021, 01:42:39 PM


If you manage to push it a little over 30MKeys/s (like by using Neon instruction set) then you can match GPU performance.

In fact this speed is achieved with a modified version of the secp256k1 library of bitcoin core where every verification instructions have been removed for field and group function.
The library field_5*52.h  (instead of 10*26 for x86 cpu) (limbs registers) is automatically include for 64bits architecture and you've right it's maybe 10x faster.

I will study the NEON instructions to improve but it will be difficult because this benchmark doesn't include the DP detection code.
It's only a pure random walk benchmark on 1 Billion points.

 
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
March 21, 2021, 01:01:33 PM
The first benchmark on ARM (Raspberry PI 400 without overclocking ) are better than I expected (around 30MKeys/s with 4 cores).

That's an impressive speed, considering that Xeon E3-1230 @3.2GHz with 4c/8t only does 10MKeys/s and that the CPU inside this is:

Code:
Broadcom BCM2711 quad-core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz

So clock speed and thread count is halved on this silicon but yet it's 3x faster. And this makes sense because x86_64 has 16 64-bit registers and 8 128-bit SSE registers (Kangaroo doesn't use AVX yet), while armv8-a (the arch used in Cortex A72) has 64 registers and 32 more of the wider registers.

x86 was always an inefficient arch anyway because of all the backward compatibility it had to preserve. With ARM it's like "recompile for our new generation or else" and stuff written for armv8 won't work on armv7 AFAIK.

Do you know if there is a very low cost card (mini PC) with ARM CPU (similar to raspberry pi compute module 4 see link below) to do the work (in a parallelized way) and see if the ratio (computing power)/price might be better compared to the latest GPU CUDA Compatible graphics card.

A 2080Ti costs $1000 and can do about ~1100MKeys/s per my guessing. A Raspberry Pi 4 costs $35 and you can buy up 28 of those for each 2080Ti, and all those Pi 4's combined can do 30x28=840 MKeys/s which is only a little slower than GPU's speed. If you manage to push it a little over 30MKeys/s (like by using Neon instruction set) then you can match GPU performance.
jr. member
Activity: 56
Merit: 26
March 21, 2021, 12:29:32 PM
Hi everybody,
I'm writing a  light beta library in C  to run Pollard kangaroo algorithm on ARM64 CPU.

I am at the very beginning of this project but i am able to do benchmark.

when the library will be functionnal i will posted it on github.

the library (a shared object file) will be callable by python with ctypes module
The herds  of kangaroos will be controlled directly with python (for convenience and ease).
The idea is to have a lot of clients running on ARM64 low cost CPU to compute parallelized secp256k1 points additions  .
The memory (DP  points of a client) will be returned with a client-server socket data transfer (server will have the main hashtable of DP ) as the Kangaroo software from Jean-Luc Pons.

The first benchmark on ARM (Raspberry PI 400 without overclocking ) are better than I expected (around 30MKeys/s with 4 cores).

What do you think about this performance ?
Do you know if there is a very low cost card (mini PC) with ARM CPU (similar to raspberry pi compute module 4 see link below)  to do the work (in a parallelized way) and see if the ratio (computing power)/price might be better compared to the latest GPU CUDA Compatible graphics card.

https://www.raspberrypi.org/products/compute-module-4/?variant=raspberry-pi-cm4001000



member
Activity: 406
Merit: 47
March 21, 2021, 06:54:39 AM

Did you do this first?
 

yes I use nano edit "Makefile"


legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
March 21, 2021, 06:00:59 AM
~snip

Did you do this first?

For some reason nobody updated the CUDA path so you need to edit the Makefile yourself and find the line that says

Code:
CUDA = /usr/local/cuda-8.0

And replace it with

Code:
CUDA = /usr/local/cuda

The CUDA installer creates a symlink at /usr/local/cuda to whatever the actual installed CUDA version is.
member
Activity: 406
Merit: 47
March 21, 2021, 04:19:44 AM


Did you just type "make" or "make gpu=1"? (What will work is "make gpu=1 ccap=52" or 61 or 75 or whatever your CUDA compute cap is)

By the way. WSL speed will be slower than just compiling it on Windows itself, because WSL emulates system calls. The GPUs will also not be recognized unless you install the Ubuntu(?) WSL drivers from NVIDIA's download page, the Windows drivers don't work from WSL.

I use "make" only
make can compile success and can run kangaroo but can not using option -gpu
show error GPU code not compiled, use -DWITHGPU when compiling.


Code:
make gpu=1
mkdir -p obj
cd obj &&       mkdir -p SECPK1
cd obj && mkdir -p GPU
/usr/local/cuda-8.0/bin/nvcc -maxrregcount=0 --ptxas-options=-v --compile --compiler-options -fPIC -ccbin /usr/bin/g++-4.8 -m64 -O2 -I/usr/local/cuda-8.0/include -gencode=arch=compute_,code=sm_ -o obj/GPU/GPUEngine.o -c GPU/GPUEngine.cu
make: /usr/local/cuda-8.0/bin/nvcc: Command not found
make: *** [Makefile:75: obj/GPU/GPUEngine.o] Error 127


Code:
make gpu=1 ccap=52
cd obj &&       mkdir -p SECPK1
cd obj && mkdir -p GPU
/usr/local/cuda-8.0/bin/nvcc -maxrregcount=0 --ptxas-options=-v --compile --compiler-options -fPIC -ccbin /usr/bin/g++-4.8 -m64 -O2 -I/usr/local/cuda-8.0/include -gencode=arch=compute_52,code=sm_52 -o obj/GPU/GPUEngine.o -c GPU/GPUEngine.cu
make: /usr/local/cuda-8.0/bin/nvcc: Command not found
make: *** [Makefile:75: obj/GPU/GPUEngine.o] Error 127

I will try on real Ubuntu on small harddisk try install


legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
March 21, 2021, 04:03:42 AM
I try this version
https://github.com/JeanLucPons/Kangaroo

on my linux subsystem (windows 10)

I compile with command  make success
make

and use
./kangaroo -gpu in.txt

I got message

GPU code not compiled, use -DWITHGPU when compiling.

Did you just type "make" or "make gpu=1"? (What will work is "make gpu=1 ccap=52" or 61 or 75 or whatever your CUDA compute cap is)

By the way. WSL speed will be slower than just compiling it on Windows itself, because WSL emulates system calls. The GPUs will also not be recognized unless you install the Ubuntu(?) WSL drivers from NVIDIA's download page, the Windows drivers don't work from WSL.
member
Activity: 406
Merit: 47
March 21, 2021, 12:27:45 AM
I try this version
https://github.com/JeanLucPons/Kangaroo

on my linux subsystem (windows 10)

I compile with command  make success
make

and use
./kangaroo -gpu in.txt

I got message

GPU code not compiled, use -DWITHGPU when compiling.

gpu meter not show

[GPU 0.00 MK/s]

How I can compile with gpu code?
full member
Activity: 1162
Merit: 237
Shooters Shoot...
March 21, 2021, 12:26:35 AM

For some reason nobody updated the CUDA path so you need to edit the Makefile yourself and find the line that says

Code:
CUDA = /usr/local/cuda-8.0

And replace it with

Code:
CUDA = /usr/local/cuda

The CUDA installer creates a symlink at /usr/local/cuda to whatever the actual installed CUDA version is.

Can kangaroo compile with CUDA 10.1 or 11 ?
I don't have cuda 8.0

Yes
member
Activity: 406
Merit: 47
March 21, 2021, 12:11:44 AM

For some reason nobody updated the CUDA path so you need to edit the Makefile yourself and find the line that says

Code:
CUDA = /usr/local/cuda-8.0

And replace it with

Code:
CUDA = /usr/local/cuda

The CUDA installer creates a symlink at /usr/local/cuda to whatever the actual installed CUDA version is.

Can kangaroo compile with CUDA 10.1 or 11 ?
I don't have cuda 8.0
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
March 20, 2021, 11:37:17 PM

Code:
# make gpu=1
cd obj &&mkdir -p SECPK1
cd obj && mkdir -p GPU
/usr/local/cuda-8.0/bin/nvcc -maxrregcount=0 --ptxas-options=-v --compile --compiler-options -fPIC -ccbin /usr/bin/g++-4.8 -m64 -O2 -I/usr/local/cuda-8.0/include -gencode=arch=compute_,code=sm_ -o obj/GPU/GPUEngine.o -c GPU/GPUEngine.cu
make: /usr/local/cuda-8.0/bin/nvcc: Command not found
make: *** [Makefile:75: obj/GPU/GPUEngine.o] Error 127


How I can compile kangaroo on windows?
upper command is compile on linux right?


For some reason nobody updated the CUDA path so you need to edit the Makefile yourself and find the line that says

Code:
CUDA = /usr/local/cuda-8.0

And replace it with

Code:
CUDA = /usr/local/cuda

The CUDA installer creates a symlink at /usr/local/cuda to whatever the actual installed CUDA version is.
member
Activity: 406
Merit: 47
March 20, 2021, 12:42:12 AM

Code:
# make gpu=1
cd obj &&mkdir -p SECPK1
cd obj && mkdir -p GPU
/usr/local/cuda-8.0/bin/nvcc -maxrregcount=0 --ptxas-options=-v --compile --compiler-options -fPIC -ccbin /usr/bin/g++-4.8 -m64 -O2 -I/usr/local/cuda-8.0/include -gencode=arch=compute_,code=sm_ -o obj/GPU/GPUEngine.o -c GPU/GPUEngine.cu
make: /usr/local/cuda-8.0/bin/nvcc: Command not found
make: *** [Makefile:75: obj/GPU/GPUEngine.o] Error 127


How I can compile kangaroo on windows?
upper command is compile on linux right?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
March 19, 2021, 04:30:27 AM
Confirmed that my custom Kangaroo mods function properly with 125 bit, 160 bit and 256 bit ranges on CPU, I will keep you guys updated as I fix up GPU/networking/merge.
We appreciate it.

Thank you.

Seems like it is more difficult than what I had thought.  JLP said it was an easy 'mod' to take it to 256 bits.

It's not that, I'm just working under shitty environmental conditions e.g we just had a major power outage yesterday that lasted the entire day, and the internet's so $@&#ing slow here  Angry This was supposed to be a one-week gig.

I *feel* that it's almost done but when we apply DevOps Borat's Law "To estimate project duration we apply Celsius to Fahrenheit formula. C is internal estimate and F is what we tell PM: C x 9/5+ 32 = F days." I estimate I am about 75% of the way there.  Undecided

Silly project management.

I was thinking one just needed to store/save the point and distances in a 256 format versus the current 128.  Program I use is much easier...it's 256 and I can just change the size limit to store how ever many characters in the point and distance rows.

The problem is that there are three different representations of big (>64) bits in the Kangaroo program, a fixed-width 4-element array used in CUDA solver, the int128_t struct you showed me earlier and that Int class which is artificially masked to 125 bits, and all of these occurrences have to be expanded or otherwise decrippled.

And a bunch of unrelated stuff are shoved into the distance Int variables do all of those have to either be moved somewhere else or otherwise phased out (hence my deterministic hashtable index function using XOR of all the 64-bit parts, because apparently that used to be in bits64[2] of the distance variable!)

Kangaroo type also had to be moved out to a 32-bit variable. Sign bit was completely removed, it was only needed because Int arithmetic already %modulus's down negative numbers obtained from arithmetic overflow.

For this kangaroo program... for the tames, the distance is a private key and the point is that private keys pubkey, so the program already knows the full point/pubkey, it has already calculated it, but pubkey was choked down to 32 characters to save RAM/file storage space... it was just a matter of storing the full pubkey (64 chars) and padding the private key/distance with zeros to equal 64 characters.

May I ask which custom program is it you are referring to?



Things like this are a step backwards, why does it assume my CUDA lives in cuda-8.0/ ?

Code:
# make gpu=1
cd obj &&mkdir -p SECPK1
cd obj && mkdir -p GPU
/usr/local/cuda-8.0/bin/nvcc -maxrregcount=0 --ptxas-options=-v --compile --compiler-options -fPIC -ccbin /usr/bin/g++-4.8 -m64 -O2 -I/usr/local/cuda-8.0/include -gencode=arch=compute_,code=sm_ -o obj/GPU/GPUEngine.o -c GPU/GPUEngine.cu
make: /usr/local/cuda-8.0/bin/nvcc: Command not found
make: *** [Makefile:75: obj/GPU/GPUEngine.o] Error 127
full member
Activity: 1162
Merit: 237
Shooters Shoot...
March 19, 2021, 01:19:44 AM
Confirmed that my custom Kangaroo mods function properly with 125 bit, 160 bit and 256 bit ranges on CPU, I will keep you guys updated as I fix up GPU/networking/merge.
We appreciate it.

Seems like it is more difficult than what I had thought.  JLP said it was an easy 'mod' to take it to 256 bits.

I was thinking one just needed to store/save the point and distances in a 256 format versus the current 128.  Program I use is much easier...it's 256 and I can just change the size limit to store how ever many characters in the point and distance rows.

For this kangaroo program... for the tames, the distance is a private key and the point is that private keys pubkey, so the program already knows the full point/pubkey, it has already calculated it, but pubkey was choked down to 32 characters to save RAM/file storage space... it was just a matter of storing the full pubkey (64 chars) and padding the private key/distance with zeros to equal 64 characters.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
March 19, 2021, 12:57:39 AM
Confirmed that my custom Kangaroo mods function properly with 125 bit, 160 bit and 256 bit ranges on CPU, I will keep you guys updated as I fix up GPU/networking/merge.
member
Activity: 406
Merit: 47
March 18, 2021, 11:51:34 PM

Like I said, the program will give you an estimated RAM usage.  If you are using save option, that is how much storage space you will need. 

Another way to look at it. If you are trying the full range of #120, you will need to perform 2^60 + 1 = 2^61. Now, take the -d you are or may want to use and subtract that from 2^61. So if you are using -d 30, you will need to save 2^31 points. If you are using -d 10, you will need to save 2^51 points. The lower your -d, the more storage space/RAM (if not using the save option) you will need.

Thank you
I will try again tonight
full member
Activity: 1162
Merit: 237
Shooters Shoot...
March 18, 2021, 10:27:24 PM

option -d
distinguished point (DP) method
That mean DP method is long size between 2 leg right
on image
https://raw.githubusercontent.com/JeanLucPons/Kangaroo/master/DOC/paths.jpg
I am understand correct?
if use high distinguished point how long kangaroo jump to collision hit
how can make it work with low resource, I try full range it use a lot of storage save work may be make it slow
now I think may be not save work for can possible help work too fast than


Like I said, the program will give you an estimated RAM usage.  If you are using save option, that is how much storage space you will need. 

Another way to look at it. If you are trying the full range of #120, you will need to perform 2^60 + 1 = 2^61. Now, take the -d you are or may want to use and subtract that from 2^61. So if you are using -d 30, you will need to save 2^31 points. If you are using -d 10, you will need to save 2^51 points. The lower your -d, the more storage space/RAM (if not using the save option) you will need.
Pages:
Jump to: