Hello everybody,
in the last days I am trying various approaches in Python to generate random private keys (hex) and the corresponding bitcoin address (optionally uncompressed, compressed, bech32 or mixed) as efficiently as possible and with good performance. There are several different approaches. I am actually very satisfied with the "bit" library from Ofek, because it is the fastest I experienced so far. However, if you also want to generate bech32 addresses, you have to do some manual extra work and the speed advantage of bit is then lost. So, as a very good and satisfying average I find the use of fastecdsa and iceland2k14/secp256k1 very fast.
EDIT:Note1:I didn't know that the Python library iceland2k14/secp256k1 was written natively in C++ and is just included in Python. Also, this library is closed-source, which means the source code is not known. So far, no security vulnerabilities have become known in this regard, but this point still deserves attention. There are some tools in the Bitcoin community that use this library for fast calculations.
Note2:As I progressed, I found that the Python library fastecdsa is too oversized for generating random private keys and is not optimal in terms of performance. The secure Python library "secrets" on the other hand is fast when choosing randomness but also fast. It results in over +32 % better performance.
In several of my python programs I am testing, I have integrated a performance counter, which shows me the total number of addresses and the calculated rate (addresses/sec) while the program is running. However, I have taken this out in this examples to keep the code as simple and clear as possible. For relative ease of comparison, I do create 1 million randomly generated keys from which the Bech32 Bitcoin address (prefix 'bc1q' is then calculated. I write the output into a file, so I can check it afterwards to ensure everything went correct and the data is correct. I "time" the Python program to see how long it took to create the 1 million addresses. Let's get into it...
Here is an example which generated bech32 addresses:
#!/usr/bin/env python3
# 2022/Dec/26, citb0in
from fastecdsa import keys, curve
import secp256k1 as ice
# Set the number of addresses to generate
num_addresses = 1000000
# Open a file for writing
with open('addresses_1M.out', 'w') as f:
# Generate and write each address to the file
for i in range(num_addresses):
prvkey_dec = keys.gen_private_key(curve.P256)
addr = ice.privatekey_to_address(2, True, prvkey_dec)
f.write(f'{addr}\n')
It took
82 seconds to generate those 1 million addresses. This is a rate about
12,195 addresses/sec which is way too slow.
real 1m22,192s
user 1m21,461s
sys 0m0,640s
I often read about numpy and from what I could find on the internet about it, it sounded interesting. So I tried to rewrite the code to use numpy and compare the results.
#!/usr/bin/env python3
# 2022/Dec/26, citb0in
import numpy as np
import fastecdsa.keys as fkeys
import fastecdsa.curve as fcurve
import secp256k1 as ice
# how many addresses to generate
num_addresses = 1000000
# Generate a NumPy array of random private keys using fastecdsa
private_keys = np.array([fkeys.gen_private_key(fcurve.P256) for _ in range(num_addresses)])
# Use secp256k1 to convert the private keys to addresses
addresses = np.array([ice.privatekey_to_address(2, True, dec) for dec in private_keys])
# Write the addresses to a file
np.savetxt('addresses_numpy.out', addresses, fmt='%s')
real 1m19,636s
user 1m18,826s
sys 0m1,027s
As you can easily see, this did not bring any speed advantage, if you disregard the 1-2sec difference.
However, what caused significant speed boost was the attempt to distribute the program code into several threads and thus enable parallel processing. As I didn't see any disadvantage by using numpy I kept the numpy part in the code. Here is the extended python code:
#!/usr/bin/env python3
# 2022/Dec/26, citb0in
import threading
import numpy as np
import fastecdsa.keys as fkeys
import fastecdsa.curve as fcurve
import secp256k1 as ice
# Set the number of addresses to generate and the number of threads to use
num_addresses = 1000000
num_threads = 16
# Divide the addresses evenly among the threads
addresses_per_thread = num_addresses // num_threads
# Create a list to store the generated addresses
addresses = []
# Define a worker function that generates a batch of addresses and stores them in the shared list
def worker(start, end):
# Generate a NumPy array of random private keys using fastecdsa
private_keys = np.array([fkeys.gen_private_key(fcurve.P256) for _ in range(start, end)])
# Use secp256k1 to convert the private keys to addresses
thread_addresses = np.array([ice.privatekey_to_address(2, True, dec) for dec in private_keys])
# Add the addresses to the shared list
addresses.extend(thread_addresses)
# Create a list of threads
threads = []
# Create and start the threads
for i in range(num_threads):
start = i * addresses_per_thread
end = (i+1) * addresses_per_thread
t = threading.Thread(target=worker, args=(start, end))
threads.append(t)
t.start()
# Wait for the threads to finish
for t in threads:
t.join()
# Write the addresses to a file
np.savetxt('addresses_1M_multithreaded.txt', addresses, fmt='%s')
My system has a total of 16 threads available, so I distributed the load across all threads for that test. Now look at this:
$ time python3 create_1M_addresses_using_fastecdsa_and_ice_numpy_multithreaded.py
real 0m19,764s
user 0m38,147s
sys 0m6,367s
It took only
20 seconds to generate 1 million addresses. That is equal to the rate of
50.000 addresses/sec. Much better but I think it's still too slow.
I did realize through Linux tool "htop" that the 16 available threads on my machine were not utilized all and not fully with 100%. So the multithreading part seemed to work ok, but not as I expected. So I looked for more ways to properly distribute the load and found what I was looking for:
#!/usr/bin/env python3
# 2022/Dec/26, citb0in
import concurrent.futures
import os
import numpy as np
import fastecdsa.keys as fkeys
import fastecdsa.curve as fcurve
import secp256k1 as ice
# Set the number of addresses to generate
num_addresses = 1000000
# Define a worker function that generates a batch of addresses and returns them
def worker(start, end):
# Generate a NumPy array of random private keys using fastecdsa
private_keys = np.array([fkeys.gen_private_key(fcurve.P256) for _ in range(start, end)])
# Use secp256k1 to convert the private keys to addresses
thread_addresses = np.array([ice.privatekey_to_address(2, True, dec) for dec in private_keys])
return thread_addresses
# Use a ProcessPoolExecutor to generate the addresses in parallel
with concurrent.futures.ProcessPoolExecutor() as executor:
# Divide the addresses evenly among the available CPU cores
addresses_per_core = num_addresses // os.cpu_count()
# Submit a task for each batch of addresses to the executor
tasks = []
for i in range(os.cpu_count()):
start = i * addresses_per_core
end = (i+1) * addresses_per_core
tasks.append(executor.submit(worker, start, end))
# Wait for the tasks to complete and retrieve the results
addresses = []
for task in concurrent.futures.as_completed(tasks):
addresses.extend(task.result())
# Write the addresses to a file
np.savetxt('addresses_1M_with_ProcessPool.txt', addresses, fmt='%s')
real 0m4,768s
user 0m54,444s
sys 0m1,894s
Perfect! Now I'm
under 5 seconds for 1 million addresses. That's a rate of roughly
208,000 addresses/secHow can we further improve the performance without using GPU? Is there anything else you can suggest, that will speed up the process?
Let's make a contest-like game. Create a simple code like this and try to beat the high score of currently 4.7 seconds for 1million bech32 addresses without using GPU.
The next step would be ==> how can we shuffle this code to the GPU for ultra-fast performance ?