Pages:
Author

Topic: the fastest possible way to mass-generate addresses in Python - page 3. (Read 1208 times)

hero member
Activity: 630
Merit: 731
Bitcoin g33k
next step is to shift the load to GPU with pycuda, numba, jit or similar. To use the wrapper PyCUDA we need to modify the code to offload the address generation and private key conversion tasks to the GPU. This can be done by defining GPU kernels that perform these tasks and then using PyCUDA's DeviceAllocation and Kernel functions to run these kernels on the GPU. Unfortunately I was unsucessful with that task.
hero member
Activity: 882
Merit: 5818
not your keys, not your coins!
Here is an example which generated bech32 addresses:
~snip~
It took 82 seconds to generate those 1 million addresses. This is a rate about 12,195 addresses/sec which is way too slow.

Quote
real   1m22,192s
user   1m21,461s
sys   0m0,640s
I was curious to try out a Rust implementation and took the first thing I could find on the web when looking for 'Rust Bitcoin address generator'.
So far, very disappointing. I added a loop of 1000 iterations to this, and some disk I/O instead of the print statements: https://github.com/mrgian/bitcoin-address-generator

It takes around 5s for 1000 addresses, 538.78s for 100k addresses. Sure, single-threaded, but your initial Python implementation was single-threaded too and just took 82s for 1 million addresses.
As ETFbitcoin mentioned, those Python libraries you use are probably heavily optimized already.
legendary
Activity: 2856
Merit: 7410
Crypto Swap Exchange
Let's make a contest-like game. Create a simple code like this and try to beat the high score of currently 4.7 seconds for 1million bech32 addresses without using GPU.

Since fastecdsa already utilize C for heavy operation, i doubt we can see major improvement without either
1. Modify some part of fastecdsa to use assembly.
2. Rewrite everything using C/C++/Rust and assembly.
hero member
Activity: 630
Merit: 731
Bitcoin g33k
I think you misunderstood. The whole should not be a project that I commission here and want to have programmed free of charge by someone. On the contrary, it is about learning and understanding so that you can use it later in various projects. It's not about me, but in general about a kind of knowledge database, from which the whole community benefits. I read here in various posts the most diverse technical questions, often even one and the same the topic. Or it is tried to reinvent the wheel. There are a few who are very good and experiences in developing good code and the matter, but knowledge is shared sparsely, so at least my impression. I am honestly said a newbie in this matter who tries to get in and understand above all. I don't have a real intention, so it's not really about finding vanity addresses or cracking puzzles. There are already solutions for that and I can't develop on those if solid basic knowledge is missing. I want to learn and maybe some day I can find ways to optimize or help others to fulfill this task. The road is the goal.

The example I showed is the basic building block of all conceivable tools that are used in this topic. If Ofek Lev had used an existing library and made do with it, there would have been no "bit" library. But he has thought about how to optimize and improve, or shortcut ways to make the whole thing more efficient. That's what it's all about. I see no disadvantage in sharing such knowledge here, I see clear advantages. If you would work together to optimize and extend such basic calculations like creating an address, a key or derivations of it, then everybody would benefit from it because it is shared publicly. This base can be used on every bitcoin tool out there. So far, unfortunately, I have not found any explanations or examples here that illustrate how such calculations would be implemented using CuPy, Numba or JIT, ergo also the origin of this thread. I would like to optimize Python code that make use of GPU and make the calculcations as fast as possible.

If you or anyone else can contribute factually, I would greatly appreciate it on behalf of the forum community and all future readers here. But describing the whole thing as pointless from the ground up helps neither me, nor you, nor anyone else.
legendary
Activity: 3444
Merit: 10537
I think I understand what you mean. There is no prize money, so I better rewrite my previous sentence:
Let's make a contest-like game. Create a simple code like this and try to beat the high score of currently 4.7 seconds for 1million bech32 addresses without using GPU.
Prize is incentive not purpose. A purpose could be something like a tool that would generate vanity addresses so you'd need the fastest possible algorithm and implementation that is capable of checking very large number of addresses per second. Or the purpose could be a recovery tool where you have 8-9 words out of a 12 word seed phrase so you need to check millions of addresses that is generated from it at different derivation paths to find the right combination.

The importance of "purpose" is that it will also clarify HOW to implement it. Right now if I understood the code correctly, you are just aimlessly writing random addresses to disk so there is not much to work with. But if the example I used is the purpose (ie vanity address), there is no need for IO and its bottleneck hence the algorithm changes, there is also no need for Base58 or Bech32 encoding, there is also no need for ECMult each permutation. These very simple changes lead to significant increase in speed.
hero member
Activity: 630
Merit: 731
Bitcoin g33k
From what I know iceland library has no malicious code in it.
One of my teachers used to say: "Thinking is not knowing" Smiley

Since it is possible to put payload to functions.
...
Or use firewall to see all network activity.
why would the author iceland2k14 then hide the source code by not disclosing it? Especially if he still uses source code from AlbertoBSD and Jean-Luc-Ponds? Apart from the fact that this would be illegal as far as I know, since the tools mentioned use appropriate GNU licenses, why would iceland2k14 hide the code if there was nothing to hide? I think this is an interesting point that ymgve2 has made here. I hadn't given this library any serious thought before, but he's right in a way. To the extent that we as users of this closed-source library don't know if malicious functions are being run in the background or data is being exfiltrated, I see that as a danger as well. I will therefore consider not using this library until I am convinced otherwise. Thanks to both of you for this important advice.

Python can never be a good choice for fast code. It is good to try out something fast. Real-deal programs use C/C++ and CUDA.

I understand that while Python code executed on the GPU using these libraries can be very fast, it may not be as fast as native C++ code running on the GPU. This is because the Python interpreter introduces some overhead, and the libraries themselves may have some overhead as well. However, for many applications, the speedup provided by running on the GPU can more than make up for any overhead introduced by Python, and can still result in significantly faster calculations than would be possible using only the CPU. That is why I think it is reasonable to get such calculations running on the GPU, it should be possible with Python.

Python can be used for fast calculations with CUDA by using libraries that provide Python bindings for CUDA. These libraries allow you to write code in Python that can be executed on the GPU, which can provide significant speedups for certain types of calculations. Examples of such libraries that provide Python bindings for CUDA are -as already mentioned before- PyCUDA or Numba. PyCUDA allows writing code in Python that is executed on the GPU using CUDA. It provides a Pythonic interface to the CUDA programming model, making it easy to use for Python developers. Numba, which is a just-in-time (JIT) compiler for Python that can automatically optimize Python code for execution on the GPU. Numba can be used to write GPU-accelerated code in Python using familiar Python syntax, and it can provide significant performance improvements for certain types of calculations. Unfortunately I wasn't able to successful use those two libraries until now, hopefully anyone else can give some insight and assistance here. Sample code is provided on my first post.

hero member
Activity: 630
Merit: 731
Bitcoin g33k
Let's make a contest-like game. Create a simple code like this and try to beat the high score of currently 4.7 seconds for 1million bech32 addresses without using GPU.
It has to have a purpose otherwise it is a complete waste of time even if someone could optimize the code to generate 1 billion addresses in 5 seconds.

I think I understand what you mean. There is no prize money, so I better rewrite my previous sentence:
Let's make a contest-like game. Create a simple code like this and try to beat the high score of currently 4.7 seconds for 1million bech32 addresses without using GPU.

I don't see it as a waste, if someone makes helpful and further posts in the forum because for this not only the current forum community benefits but also those who will read here in the future. I see it as an enrichment for the forum. We could change the rules to say that the use of GPU via CUDA should be allowed. I myself have tried various approaches so far using pycu and numba, but I was always unsuccessful. So if someone can do this simple task also within CUDA and can present the result, please feel free to post here. I think everyone will benefit from it.

I see that you're using gen_private_key. This generates arbitrary secure random private keys, using urandom. urandom might be slow depending on your system - do you really need it?
Honestly said, I don't care about if urandom is used or random, I just want to quickly generate random numbers for serving as a privkey. Can you suggest a faster approach which can replace this task?

Quote
I also see that iceland2k14/secp256k1 does not provide the source for its libraries which is a bit concerning.
what concerns in detail you have, can you explain, please?
legendary
Activity: 3444
Merit: 10537
Let's make a contest-like game. Create a simple code like this and try to beat the high score of currently 4.7 seconds for 1million bech32 addresses without using GPU.
It has to have a purpose otherwise it is a complete waste of time even if someone could optimize the code to generate 1 billion addresses in 5 seconds.
full member
Activity: 161
Merit: 230
I see that you're using gen_private_key. This generates arbitrary secure random private keys, using urandom. urandom might be slow depending on your system - do you really need it? I assume whatever you're doing (brute force search?) will generate the private keys in some other fashion, so you can just set prvkey_dec to an integer representing the private key directly.

I also see that iceland2k14/secp256k1 does not provide the source for its libraries which is a bit concerning.
hero member
Activity: 630
Merit: 731
Bitcoin g33k
Hello everybody,

in the last days I am trying various approaches in Python to generate random private keys (hex) and the corresponding bitcoin address (optionally uncompressed, compressed, bech32 or mixed) as efficiently as possible and with good performance. There are several different approaches. I am actually very satisfied with the "bit" library from Ofek, because it is the fastest I experienced so far. However, if you also want to generate bech32 addresses, you have to do some manual extra work and the speed advantage of bit is then lost. So, as a very good and satisfying average I find the use of fastecdsa and iceland2k14/secp256k1 very fast.

EDIT:
Note1:
I didn't know that the Python library iceland2k14/secp256k1 was written natively in C++ and is just included in Python. Also, this library is closed-source, which means the source code is not known. So far, no security vulnerabilities have become known in this regard, but this point still deserves attention. There are some tools in the Bitcoin community that use this library for fast calculations.

Note2:
As I progressed, I found that the Python library fastecdsa is too oversized for generating random private keys and is not optimal in terms of performance. The secure Python library "secrets" on the other hand is fast when choosing randomness but also fast. It results in over +32 % better performance.


In several of my python programs I am testing, I have integrated a performance counter, which shows me the total number of addresses and the calculated rate (addresses/sec) while the program is running. However, I have taken this out in this examples to keep the code as simple and clear as possible. For relative ease of comparison, I do create 1 million randomly generated keys from which the Bech32 Bitcoin address (prefix 'bc1q' is then calculated. I write the output into a file, so I can check it afterwards to ensure everything went correct and the data is correct. I "time" the Python program to see how long it took to create the 1 million addresses. Let's get into it...

Here is an example which generated bech32 addresses:
Code:
#!/usr/bin/env python3
# 2022/Dec/26, citb0in
from fastecdsa import keys, curve
import secp256k1 as ice

# Set the number of addresses to generate
num_addresses = 1000000

# Open a file for writing
with open('addresses_1M.out', 'w') as f:
  # Generate and write each address to the file
  for i in range(num_addresses):
    prvkey_dec   = keys.gen_private_key(curve.P256)
    addr = ice.privatekey_to_address(2, True, prvkey_dec)
    f.write(f'{addr}\n')

It took 82 seconds to generate those 1 million addresses. This is a rate about 12,195 addresses/sec which is way too slow.

Quote
real   1m22,192s
user   1m21,461s
sys   0m0,640s

I often read about numpy and from what I could find on the internet about it, it sounded interesting. So I tried to rewrite the code to use numpy and compare the results.

Code:
#!/usr/bin/env python3
# 2022/Dec/26, citb0in
import numpy as np
import fastecdsa.keys as fkeys
import fastecdsa.curve as fcurve
import secp256k1 as ice

# how many addresses to generate
num_addresses = 1000000

# Generate a NumPy array of random private keys using fastecdsa
private_keys = np.array([fkeys.gen_private_key(fcurve.P256) for _ in range(num_addresses)])

# Use secp256k1 to convert the private keys to addresses
addresses = np.array([ice.privatekey_to_address(2, True, dec) for dec in private_keys])

# Write the addresses to a file
np.savetxt('addresses_numpy.out', addresses, fmt='%s')

Quote
real   1m19,636s
user   1m18,826s
sys   0m1,027s

As you can easily see, this did not bring any speed advantage, if you disregard the 1-2sec difference.

However, what caused significant speed boost was the attempt to distribute the program code into several threads and thus enable parallel processing. As I didn't see any disadvantage by using numpy I kept the numpy part in the code. Here is the extended python code:
Code:
#!/usr/bin/env python3
# 2022/Dec/26, citb0in
import threading
import numpy as np
import fastecdsa.keys as fkeys
import fastecdsa.curve as fcurve
import secp256k1 as ice

# Set the number of addresses to generate and the number of threads to use
num_addresses = 1000000
num_threads = 16

# Divide the addresses evenly among the threads
addresses_per_thread = num_addresses // num_threads

# Create a list to store the generated addresses
addresses = []

# Define a worker function that generates a batch of addresses and stores them in the shared list
def worker(start, end):
  # Generate a NumPy array of random private keys using fastecdsa
  private_keys = np.array([fkeys.gen_private_key(fcurve.P256) for _ in range(start, end)])

  # Use secp256k1 to convert the private keys to addresses
  thread_addresses = np.array([ice.privatekey_to_address(2, True, dec) for dec in private_keys])

  # Add the addresses to the shared list
  addresses.extend(thread_addresses)

# Create a list of threads
threads = []

# Create and start the threads
for i in range(num_threads):
  start = i * addresses_per_thread
  end = (i+1) * addresses_per_thread
  t = threading.Thread(target=worker, args=(start, end))
  threads.append(t)
  t.start()

# Wait for the threads to finish
for t in threads:
  t.join()

# Write the addresses to a file
np.savetxt('addresses_1M_multithreaded.txt', addresses, fmt='%s')

My system has a total of 16 threads available, so I distributed the load across all threads for that test. Now look at this:
Code:
$ time python3 create_1M_addresses_using_fastecdsa_and_ice_numpy_multithreaded.py

Quote
real   0m19,764s
user   0m38,147s
sys   0m6,367s

It took only 20 seconds to generate 1 million addresses. That is equal to the rate of 50.000 addresses/sec. Much better but I think it's still too slow.

I did realize through Linux tool "htop" that the 16 available threads on my machine were not utilized all and not fully with 100%. So the multithreading part seemed to work ok, but not as I expected. So I looked for more ways to properly distribute the load and found what I was looking for:

Code:
#!/usr/bin/env python3
# 2022/Dec/26, citb0in
import concurrent.futures
import os
import numpy as np
import fastecdsa.keys as fkeys
import fastecdsa.curve as fcurve
import secp256k1 as ice

# Set the number of addresses to generate
num_addresses = 1000000

# Define a worker function that generates a batch of addresses and returns them
def worker(start, end):
  # Generate a NumPy array of random private keys using fastecdsa
  private_keys = np.array([fkeys.gen_private_key(fcurve.P256) for _ in range(start, end)])

  # Use secp256k1 to convert the private keys to addresses
  thread_addresses = np.array([ice.privatekey_to_address(2, True, dec) for dec in private_keys])

  return thread_addresses

# Use a ProcessPoolExecutor to generate the addresses in parallel
with concurrent.futures.ProcessPoolExecutor() as executor:
  # Divide the addresses evenly among the available CPU cores
  addresses_per_core = num_addresses // os.cpu_count()

  # Submit a task for each batch of addresses to the executor
  tasks = []
  for i in range(os.cpu_count()):
    start = i * addresses_per_core
    end = (i+1) * addresses_per_core
    tasks.append(executor.submit(worker, start, end))

  # Wait for the tasks to complete and retrieve the results
  addresses = []
  for task in concurrent.futures.as_completed(tasks):
    addresses.extend(task.result())

# Write the addresses to a file
np.savetxt('addresses_1M_with_ProcessPool.txt', addresses, fmt='%s')

Quote
real   0m4,768s
user   0m54,444s
sys   0m1,894s

Perfect! Now I'm under 5 seconds for 1 million addresses. That's a rate of roughly 208,000 addresses/sec
How can we further improve the performance without using GPU? Is there anything else you can suggest, that will speed up the process?
Let's make a contest-like game. Create a simple code like this and try to beat the high score of currently 4.7 seconds for 1million bech32 addresses without using GPU.

The next step would be ==> how can we shuffle this code to the GPU for ultra-fast performance ?
Pages:
Jump to: