I recently converted my entire ETH farm (about 80x RTX 3080) into a puzzle-hunting farm. During this process, I explored various online tools, including KeyHuntCuda, RotorCuda, etc.
Most of these tools use the Vanity Search libraries by Jean Luc PONS.
Since my focus is on targeting a single address (without a public key), both sequentially and randomly,
I decided to rewrite the ComputeKeysSEARCH_MODE_SA function from CPUCompute.h.
With just one RTX 3080, which typically achieves around 2000M keys, I managed to reach 2500M keys with these modifications.
Moreover, with the original version, when using -rkey and a new random cycle starts, there is a significant drop in performance. However, with this version, the performance drop is minimal, and the speed quickly recovers when a new random cycle begins.
Below is a brief summary of the changes I made:
- Additional Synchronization (__syncthreads()) to ensure consistency between threads.
- Loop Unrolling to improve loop efficiency.
- Local Caching of values (Gx, Gy, sx, sy) to reduce global memory latency.
- Combined Calculation of P + i*G and P - i*G to reduce code duplication.
- Prefetching values from global memory into local registers for better performance.
- Optimized Multiplications using __umul64hi for faster 64-bit calculations.
- Optimized Matching Check (CHECK_HASH_SEARCH_MODE_SA) to reduce latency between calculations and checks.
So I wanted to share with all of you guys.
Waiting for your feedback cheers
uint32_t* hash160, uint32_t maxFound, uint32_t* out)
{
uint64_t dx[GRP_SIZE / 2 + 1][4];
uint64_t px[4];
uint64_t py[4];
uint64_t pyn[4];
uint64_t sx[4];
uint64_t sy[4];
uint64_t dy[4];
uint64_t _s[4];
uint64_t _p2[4];
__syncthreads();
// Load starting key
Load256A(sx, startx);
Load256A(sy, starty);
Load256(px, sx);
Load256(py, sy);
__syncthreads();
// Fill group with delta x
uint32_t i;
for (i = 0; i < HSIZE; i++) {
ModSub256(dx[i], Gx + 4 * i, sx);
}
ModSub256(dx[i], Gx + 4 * i, sx); // For the first point
ModSub256(dx[i + 1], _2Gnx, sx); // For the next center point
__syncthreads();
// Compute modular inverse
_ModInvGrouped(dx);
// We use the fact that P + i*G and P - i*G has the same deltax, so the same inverse
// We compute key in the positive and negative way from the center of the group
__syncthreads();
// Check starting point
CHECK_HASH_SEARCH_MODE_SA(GRP_SIZE / 2);
ModNeg256(pyn, py);
__syncthreads();
// Loop unrolling and variable caching for optimization
#pragma unroll
for (i = 0; i < HSIZE; i++) {
// Pre-fetching Gx and Gy values to reduce memory latency
uint64_t gx_local[4], gy_local[4];
Load256(gx_local, Gx + 4 * i);
Load256(gy_local, Gy + 4 * i);
// Pre-fetching starting key values to reuse in both P + i*G and P - i*G calculations
uint64_t px_local[4], py_local[4];
Load256(px_local, sx);
Load256(py_local, sy);
// Compute P + i*G and P - i*G within a single iteration
uint64_t dy_local[4], _s_local[4], _p2_local[4];
for (int sign = 1; sign >= -1; sign -= 2) {
Load256(px, px_local);
Load256(py, py_local);
if (sign == 1) {
ModSub256(dy_local, gy_local, py); // P + i*G
} else {
ModSub256(dy_local, pyn, gy_local); // P - i*G
}
_s_local[0] = __umul64hi(dy_local[0], dx[i][0]); // s = (p2.y - p1.y) * inverse(p2.x - p1.x)
_p2_local[0] = __umul64hi(_s_local[0], _s_local[0]); // _p2 = pow2(s)
ModSub256(px, _p2_local, px);
ModSub256(px, gx_local); // px = pow2(s) - p1.x - p2.x;
ModSub256(py, gx_local, px);
_ModMult(py, _s_local); // py = - s * (ret.x - p2.x)
ModSub256(py, gy_local); // py = - p2.y - s * (ret.x - p2.x);
if (sign == 1) {
CHECK_HASH_SEARCH_MODE_SA(GRP_SIZE / 2 + (i + 1));
} else {
CHECK_HASH_SEARCH_MODE_SA(GRP_SIZE / 2 - (i + 1));
}
}
}