How much constant memory do you use for the multiplication and for the addition?
32 jumps are 16kB for x and y-coordinate + 8 kB for their private keys (32 * 256bit = 8kB) + what else?
I use the following setting to prefer L1 cache as shared mem is not used.
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
In constant mem:
__device__ __constant__ uint64_t _0[] = { 0ULL,0ULL,0ULL,0ULL,0ULL };
__device__ __constant__ uint64_t _1[] = { 1ULL,0ULL,0ULL,0ULL,0ULL };
__device__ __constant__ uint64_t _P[] = { 0xFFFFFFFEFFFFFC2F,0xFFFFFFFFFFFFFFFF,0xFFFFFFFFFFFFFFFF,0xFFFFFFFFFFFFFFFF,0ULL };
__device__ __constant__ uint64_t MM64 = 0xD838091DD2253531; // 64bits lsb negative inverse of P (mod 2^64)
__device__ __constant__ uint64_t _O[] = { 0xBFD25E8CD0364141ULL,0xBAAEDCE6AF48A03BULL,0xFFFFFFFFFFFFFFFEULL,0xFFFFFFFFFFFFFFFFULL
__device__ __constant__ uint64_t jD[NB_JUMP][4];
__device__ __constant__ uint64_t jPx[NB_JUMP][4];
__device__ __constant__ uint64_t jPy[NB_JUMP][4];
I will definitely reduce jD to 128 bits in the next release, the less constant mem usage is better, there is 64Kb available but for L1 cache the lowest is the best.
Yes, this is true for small range (as written in the README).
Thanks
Yes, we didn't launch the run yet, we will make our choice in the days to come. Small DP also increase needed mem.
Right.
~2^20 for the last 2 runs, we will see for #120.