How much constant memory do you use for the multiplication and for the addition?
32 jumps are 16kB for x and y-coordinate + 8 kB for their private keys (32 * 256bit = 8kB) + what else?
I use the following setting to prefer L1 cache as shared mem is not used.
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
In constant mem:
__device__ __constant__ uint64_t _0[] = { 0ULL,0ULL,0ULL,0ULL,0ULL };
__device__ __constant__ uint64_t _1[] = { 1ULL,0ULL,0ULL,0ULL,0ULL };
__device__ __constant__ uint64_t _P[] = { 0xFFFFFFFEFFFFFC2F,0xFFFFFFFFFFFFFFFF,0xFFFFFFFFFFFFFFFF,0xFFFFFFFFFFFFFFFF,0ULL };
__device__ __constant__ uint64_t MM64 = 0xD838091DD2253531; // 64bits lsb negative inverse of P (mod 2^64)
__device__ __constant__ uint64_t _O[] = { 0xBFD25E8CD0364141ULL,0xBAAEDCE6AF48A03BULL,0xFFFFFFFFFFFFFFFEULL,0xFFFFFFFFFFFFFFFFULL
__device__ __constant__ uint64_t jD[NB_JUMP][4];
__device__ __constant__ uint64_t jPx[NB_JUMP][4];
__device__ __constant__ uint64_t jPy[NB_JUMP][4];
I will definitely reduce jD to 128 bits in the next release, the less constant mem usage is better, there is 64Kb available but for L1 cache the lowest is the best.
128 bit * 32 = 4kB saved, good.
If you accept to break the compatibility with the #115 search, you can save another 1kB picking as jumps points with the first 32 bits of the x-coordinate = 0; you have many of them in the file of the old DPs.