Author

Topic: Is there a way to use Python GPU for ECC speed up? (Read 570 times)

newbie
Activity: 3
Merit: 168
https://github.com/iceland2k14/secp256k1 fastest library for python.
hero member
Activity: 630
Merit: 731
Bitcoin g33k
@Mikorist:

Your code is faulty. For the GPU part you are creating only one single address, you missed the loop for creating the same amount of addresses as you did on the CPU part Grin that's why you get such unrealistic low time results for the GPU part. Try to modify your code and output the generated keys/addresses/wifs/etc... to a file, each for CPU and GPU part and you will quickly recognize what's happening  Wink

Here's the revised code:

Code:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0"
import numpy as np
import numba
from numba import cuda, jit
from timeit import default_timer as timer
from fastecdsa import keys, curve
import secp256k1 as ice

# number of addresses to generate
num_generate=100000

# Run on CPU
def cpu(a):
  with open('bench.cpu', 'w') as f_cpu:
    for i in range(num_generate):
        prvkey_dec  = keys.gen_private_key(curve.P256)
        prvkey_hex  = "%064x" % prvkey_dec
        wifc        = ice.btc_pvk_to_wif(prvkey_hex)
        wifu        = ice.btc_pvk_to_wif(prvkey_hex, False)
        uaddr       = ice.privatekey_to_address(0, False, prvkey_dec)
        caddr       = ice.privatekey_to_address(0, True, prvkey_dec)
        f_cpu.write(f'PrivateKey Hex: {prvkey_hex}\nWIF compressed: {wifc}\nWIF uncompressed: {wifu}\nAddress uncompressed: {uaddr}\nAddress compressed:{caddr}\n\n')
        a[i]+= 1
# Run on GPU
numba.jit()
def gpu(b):
  with open('bench.gpu', 'w') as f_gpu:
    for i in range(num_generate):
        prvkey_dec      = keys.gen_private_key(curve.P256)
        prvkey_hex      = "%064x" % prvkey_dec
        wifc            = ice.btc_pvk_to_wif(prvkey_hex)
        wifu            = ice.btc_pvk_to_wif(prvkey_hex, False)
        uaddr           = ice.privatekey_to_address(0, False, prvkey_dec)
        caddr           = ice.privatekey_to_address(0, True, prvkey_dec)
        f_gpu.write(f'PrivateKey Hex: {prvkey_hex}\nWIF compressed: {wifc}\nWIF uncompressed: {wifu}\nAddress uncompressed: {uaddr}\nAddress compressed:{caddr}\n\n')
        #return b+1
if __name__=="__main__":
    a = np.ones(num_generate, dtype = np.float64)
    startCPU = timer()
    cpu(a)
    print("without GPU:", timer()-startCPU)
    b = np.ones(num_generate, dtype = np.float64)
    startGPU = timer()
    gpu(b)
    numba.cuda.profile_stop()
    print("with GPU:", timer()-startCPU)

On my system for 100,000 addresses being generated, the result is:

Quote
without GPU: 5.16345653499593
with GPU: 11.49221567499626

 Wink As I mentioned in the other thread of your posting, the GPU is not utilized at all. You can check the stats of your GPU and you will see that 0% are utilized while the GPU part of that python code is running. I did not dig into the jit part so I cannot tell you what is needed to have this code accelerated on GPU using CUDA.
newbie
Activity: 9
Merit: 0
I had driver problems on Debian 11

Code:
sudo nano  /etc/modprobe.d/blacklist-nouveau.conf
blacklist-nouveau.conf :

Code:
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

Then reboot and reinstall
Code:
sudo apt -y install  nvidia-cuda-toolkit nvidia-cuda-dev nvidia-driver

It works like this. GPU must be specified as  "0" . Or who has more than one of them in the same command.
Code:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0"
import numpy as np
import numba
from numba import cuda, jit
from timeit import default_timer as timer
from fastecdsa import keys, curve
import secp256k1 as ice
# Run on CPU
def cpu(a):
    for i in range(100000):
        dec   = keys.gen_private_key(curve.P256)
        HEX   = "%064x" % dec
        wifc  = ice.btc_pvk_to_wif(HEX)
        wifu  = ice.btc_pvk_to_wif(HEX, False)
        uaddr = ice.privatekey_to_address(0, False, dec)
        caddr = ice.privatekey_to_address(0, True, dec)
        a[i]+= 1
# Run on GPU
numba.jit()
def gpu(x):
    dec   = keys.gen_private_key(curve.P256)
    HEX   = "%064x" % dec
    wifc  = ice.btc_pvk_to_wif(HEX)
    wifu  = ice.btc_pvk_to_wif(HEX, False)
    uaddr = ice.privatekey_to_address(0, False, dec)
    caddr = ice.privatekey_to_address(0, True, dec)
    return x+1
if __name__=="__main__":
    n = 100000
    a = np.ones(n, dtype = np.float64)
    start = timer()
    cpu(a)
    print("without GPU:", timer()-start)
    start = timer()
    gpu(a)
    numba.cuda.profile_stop()
    print("with GPU:", timer()-start)


Result
without GPU: 10.30411118400002
with GPU: 0.2935101880000275

 Grin


p.s.
 tried using following decorators:

Code:
@numba.jit(target='cuda')
@numba.jit(target='gpu')
@numba.cuda.jit

It is even faster without anything as a signature argument.
Code:
@numba.jit()
without GPU: 8.928111962999992
with GPU: 0.06683745000009367



newbie
Activity: 9
Merit: 0
I have no idea how secp256k1 as ice & fastecdsa  optimized for GPU jit, (maybe not at all) but we can test this anyway...Without experimentation there is no progress.

You have to install the CUDA Toolkit for this & numba on Linux....
Code:
conda install numba & conda install cudatoolkit
or
Code:
pip3 install numba numpy fastecdsa
(etc...)

Code:
import numpy as np 
import numba
from numba import cuda, jit
from timeit import default_timer as timer
from fastecdsa import keys, curve
import secp256k1 as ice
# Run on CPU
def cpu(a):
    for i in range(100000):
        dec   = keys.gen_private_key(curve.P256)
        HEX   = "%064x" % dec
        wifc  = ice.btc_pvk_to_wif(HEX)
        wifu  = ice.btc_pvk_to_wif(HEX, False)
        uaddr = ice.privatekey_to_address(0, False, dec)
        caddr = ice.privatekey_to_address(0, True, dec)
        a[i]+= 1
# Run on GPU
@numba.jit(forceobj=True)
def gpu(x):
    dec   = keys.gen_private_key(curve.P256)
    HEX   = "%064x" % dec
    wifc  = ice.btc_pvk_to_wif(HEX)
    wifu  = ice.btc_pvk_to_wif(HEX, False)
    uaddr = ice.privatekey_to_address(0, False, dec)
    caddr = ice.privatekey_to_address(0, True, dec)
    return x+1
if __name__=="__main__":
    n = 100000
    a = np.ones(n, dtype = np.float64)
    start = timer()
    cpu(a)
    print("without GPU:", timer()-start)
    start = timer()
    gpu(a)
    numba.cuda.profile_stop()
    print("with GPU:", timer()-start)

p.s.
It is a bad idea to start this test with the print command. So I removed the command...
result

without GPU: 8.594641929998033

-----------

It throws errors here
Code:
.local/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py", line 231, in _require_cuda_context
    with _runtime.ensure_context():
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)


 even with the option
Code:
 @numba.jit(forceobj=True)
I'll try again later...maybe it's my system and drivers - and maybe it's not.
newbie
Activity: 9
Merit: 0


I know that in C++, you could use Boost library/dependency to have bigger integer variables. Is there a way to do the same in Python GPU?


There is a way, but there are many readings until the final implementation.

https://documen.tician.de/pycuda/array.html
newbie
Activity: 9
Merit: 0
https://github.com/iceland2k14/secp256k1 fastest library for python.

And fastecdsa......
https://github.com/AntonKueltz/fastecdsa

Code:
from fastecdsa import keys, curve
import secp256k1 as ice

while True:
           dec   = keys.gen_private_key(curve.P256)
           HEX   = "%064x" % dec
           wifc  = ice.btc_pvk_to_wif(HEX)
           wifu  = ice.btc_pvk_to_wif(HEX, False)
           uaddr = ice.privatekey_to_address(0, False, dec)
           caddr = ice.privatekey_to_address(0, True, dec)
           print(wifu, uaddr)

Check how fast this simple generator is....Zillions per second. Grin
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Thank you for your answer. How to use a import secp256k1 as ice in my code bro ? secp256k1 is more faste. I need fast code.

Thanks

In Python you can just use the pow() function which is available since 3.8. It's probably based off of native C/C++ but I could be wrong.

For older versions of python there is a Euclidean GCD algorithm you can copy/paste from here, but you shouldn't need that as all modern distros have at least Python 3.8.
member
Activity: 873
Merit: 22
$$P2P BTC BRUTE.JOIN NOW ! https://uclck.me/SQPJk
if you wanna play with numbers in the ring modulo N :
N = 115792089237316195423570985008687907852837564279074904382605163141518161494337

def modinv(a,n):
    lm = 1
    hm = 0
    low = a%n
    high = n
    while low > 1:
        ratio = high//low
        nm = hm-lm*ratio
        new = high-low*ratio
        high = low
        low = new
        hm = lm
        lm = nm
    return lm % n

def inv(a):
    return N - a
   
def add(a,b):
    return (a + b) % N

def sub(a,b):
    return (a + inv(b)) % N

def mul(a,b):
    return (a * b) % N
   
def div(a,b):
    return (a * modinv(b,N)) % N

in your case just write c file like this
#include

int add_int(int, int);
float add_float(float, float);
void print_sc(char *ptr);

int add_int(int num1, int num2){
    return num1 + num2;
}

and convert you function to c
if you are on windows:
gcc -c -DBUILD_DLL your_filename.c
gcc -shared -o your_filename.dll your_filename.o
will get you dll library
then just use ctypes to load and use it.

Thank you for your answer. How to use a import secp256k1 as ice in my code bro ? secp256k1 is more faste. I need fast code.

Thanks
member
Activity: 873
Merit: 22
$$P2P BTC BRUTE.JOIN NOW ! https://uclck.me/SQPJk
in the settings file first line max range value then goes min range value after goes divide. it will be changed and saved during program work. bloomfilter you can use any. the main thing there is to prepare bloomfilter beforehand so we could load it fast. it is just an example to show what i mean.


Hi, how convert this code to c and ecdsa256k1 lib ?

divnum is a multiply to mod inversion:

Code:
from random import randint

N =    115792089237316195423570985008687907852837564279074904382605163141518161494337

def inv(v): return pow(v, N-2, N)
def divnum(a, b): return ( (a * inv(b) ) % N )

i=0
#input2^^120 = 0x9fd24b3abe244d6c443df56fa494dc

input = 0x5f87 +1

delta = 12

gamma = 2

d1= 80

while i < 2**61:
    d= (divnum(input,delta))
    s = divnum(i,gamma) %N
    result = divnum(d,s)
   
    if result =0:
        print("result",hex(result),"i",hex(i),"input",hex(input))
       
    i = i +1


?

Thank you.
full member
Activity: 431
Merit: 105
really would like to try and make it work, could you explain a little where the settingsfile and bloom can be found.



from datetime import datetime
import secp256k1 as ice
from bloomfilter import *

bloomfile = 'bloomfile_shiftdown.bf'
settingsFile = "divide_search_point_add.txt"



thanks a lot
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
If you bruteforce bitcoin range for example one way of speeding up is not to use scalar multiplication at all. When you go through range just use start scalar and start point. increase scalar and increase start point by point addition. simple as that.

I don't think brute-force would be particularly effective in the first place, as private keys that are close to each other in value are so rare that it would have to be artificially created instead of generated by an RNG.

Pseudo-random bits could help I.e. bit 0 is generated from a simple mod sequence, bit 1, bit 2 from a different sequence, and so on...
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
I don't know any ready-to-use 256bit number numpy libraries, but it is possible to create one, using 64 or 32bit numbers for math operations.
You cannot just speed up individual operations like point multiplication by using GPU, because single CUDA core is much slower than CPU. You need to divide full computing work into many independent tasks which will run in parallel in order to get the performance gain.

What you are suggesting is that it is just easier to use C++ with boost, while simultaneously implementing a multithreading approach, and everything would be running on a CPU?

I mean it could be a better way to think about it, since I also want to port that on laptop or other devices that don't have either a GPU or NVIDIA tool kit installed

Why the need for Boost? I wouldn't use any of its libraries inside performance-intensive loops, but for things like Program Options that run only once, then it's fine.

Boost is known to compromise speed if it makes the interface cleaner.
newbie
Activity: 28
Merit: 84
I don't know any ready-to-use 256bit number numpy libraries, but it is possible to create one, using 64 or 32bit numbers for math operations.
You cannot just speed up individual operations like point multiplication by using GPU, because single CUDA core is much slower than CPU. You need to divide full computing work into many independent tasks which will run in parallel in order to get the performance gain.

What you are suggesting is that it is just easier to use C++ with boost, while simultaneously implementing a multithreading approach, and everything would be running on a CPU?

I mean it could be a better way to think about it, since I also want to port that on laptop or other devices that don't have either a GPU or NVIDIA tool kit installed
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
I don't know any ready-to-use 256bit number numpy libraries, but it is possible to create one, using 64 or 32bit numbers for math operations.
You cannot just speed up individual operations like point multiplication by using GPU, because single CUDA core is much slower than CPU. You need to divide full computing work into many independent tasks which will run in parallel in order to get the performance gain.

I guess you could try using an algorithm that computes multiple point multiplications at once - incrementally, not using threads or CUDA cores. This will safe you time as long as you only batch multiply as many points as it takes to do (according to the paper) 5 serial ECmults.
member
Activity: 110
Merit: 61
I don't know any ready-to-use 256bit number numpy libraries, but it is possible to create one, using 64 or 32bit numbers for math operations.
You cannot just speed up individual operations like point multiplication by using GPU, because single CUDA core is much slower than CPU. You need to divide full computing work into many independent tasks which will run in parallel in order to get the performance gain.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
You have to make your own "fixed width" decimal class that represents numbers in Base-2 notation if you want to implement some kind of support for 256-bit data.

I have a C++ (not python) fixed-width class, but it's in Base-10, sorry.
newbie
Activity: 28
Merit: 84
Hello. Got interested recently in Cuda technology for existing Python SECP256K1 code speed up. Installed Nvidia 10.2 toolkit and It seems to be running OK to an extent.

I've scrapped a small script for PyCuda that just doubles an integer
Code:
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

import numpy
#Working with integers
a = 126
a = numpy.int64(a)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
  __global__ void doublify(int *a)
  {
    int idx = threadIdx.x + threadIdx.y*4;
    a[idx] *= 2;
  }
  """)
func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print(a_doubled)
print(a)

However, It can't work with big numbers (of 256 bit size). When passing for example:
Code:
a = 0xfffffffffffffffffffffffffffffffebaaedce6af48a03bbfd25e8cd0364140

There's an error:
Code:
OverflowError: int too big to convert

Is there a way to use Big integers in PyCuda, CuPy, just a GPU implementation of Python?
Stumbled on Stackoverflow post
https://stackoverflow.com/questions/68215238/numpy-256-bit-computing
but didn't understand anything in it.

I know that in C++, you could use Boost library/dependency to have bigger integer variables. Is there a way to do the same in Python GPU?

Also, Does it even make any sense to use Python GPU solutions, since the main calculations are made in "SourceModule" kernel that has to be coded in C++ anyway?

May be it is just easier to recode the existing python code in C++ with boost library and later add CUDA GPU rendering?
Jump to: