New version is slower on my pc (132 MKeys/s against 162 MKeys/s).
On my Windows, performance are the same than the previous release (Cuda 10).
Slightly slower on Linux (Cuda 8.0), from 39.5MK/s to 37.9MK/s.
Anyway,
Do you compile or do you use Linux binaries ?
Do you solved your problem ? I didn't manage to reproduce the issue yet.
I compile the source myself. No, my problem is not solved. I have only Cuda 8.0.
Some ideas for (maybe) a little speed improvement:
1) in __device__ void ComputeKeys (GPUCompute.h) instead of doing HSIZE times
ModNeg256(dy,Gy[i]); <--
ModSub256(dy, py);
you could do:
ModSub256(dy, pyn, Gy[i]);
and you compute only once pyn:
2) instead of
To sum up:
ModSub256(dy, pyn, Gy[i]);
_ModMult(_s, dy, dx[i]); // s = (p2.y-p1.y)*inverse(p2.x-p1.x)
_ModMult(_p2, _s, _s); // _p = pow2(s)
ModSub256(px, _p2, px);
ModSub256(px, Gx[i]); // px = pow2(s) - p1.x - p2.x;
ModSub256(py, sx, px);
_ModMult(py, _s); // py = - s*(ret.x-p2.x)
ModSub256(py, sy); // py = - p2.y - s*(ret.x-p2.x);
3) in __device__ void ModSub256 instead of
if ((int64_t)t < 0) {
UADDO1(r[0], _P[0]);
UADDC1(r[1], _P[1]);
UADDC1(r[2], _P[2]);
UADD1(r[3], _P[3]);
}
it would be better something like that:
if ((int64_t)t < 0) {
USUBO1(r[0], 0x01000003d1);
USUBC1(r[1], 0ULL);
USUBC1(r[2], 0ULL);
USUBC1(r[3], 0ULL);
}
(I'm not sure what C means, I suppose means with carry)