@JeanLuc
How much constant memory do you use for the multiplication and for the addition?
32 jumps are 16kB for x and y-coordinate + 8 kB for their private keys (32 * 256bit = 8kB) + what else?
I use the following setting to prefer L1 cache as shared mem is not used.
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
In constant mem:
__device__ __constant__ uint64_t _0[] = { 0ULL,0ULL,0ULL,0ULL,0ULL };
__device__ __constant__ uint64_t _1[] = { 1ULL,0ULL,0ULL,0ULL,0ULL };
__device__ __constant__ uint64_t _P[] = { 0xFFFFFFFEFFFFFC2F,0xFFFFFFFFFFFFFFFF,0xFFFFFFFFFFFFFFFF,0xFFFFFFFFFFFFFFFF,0ULL };
__device__ __constant__ uint64_t MM64 = 0xD838091DD2253531; // 64bits lsb negative inverse of P (mod 2^64)
__device__ __constant__ uint64_t _O[] = { 0xBFD25E8CD0364141ULL,0xBAAEDCE6AF48A03BULL,0xFFFFFFFFFFFFFFFEULL,0xFFFFFFFFFFFFFFFFULL
__device__ __constant__ uint64_t jD[NB_JUMP][4];
__device__ __constant__ uint64_t jPx[NB_JUMP][4];
__device__ __constant__ uint64_t jPy[NB_JUMP][4];
I will definitely reduce jD to 128 bits in the next release, the less constant mem usage is better, there is 64Kb available but for L1 cache the lowest is the best.
In that case best choice for solving keys is using CPU
Yes, this is true for small range (as written in the README).
Great work!
Thanks
For the #120, if you and Zielar use 2^30 kangaroos, you need to use a DP < 28.
Yes, we didn't launch the run yet, we will make our choice in the days to come. Small DP also increase needed mem.
If you reduce k, you reduce the speed, then you have to reduce theta (DP).
Right.
How many kangaroos run in parallel on a single V100 ? At wich speed?
~2^20 for the last 2 runs, we will see for #120.