What in/why is the copyBigInt() causing error? Same error appears in multiple programs written prior to release of RTX 30xx cards.
It's not copyBigInt() itself that's problematic (it's a simple element-wise assignment) but one of the arrays passed to it which is not aligned. CUDA wants all arrays aligned to 32-but boundaries and one of the arrays that eventually reaches copyBigInt() comes from "xp" and "x" pointer arguments of beginBatchAdd()...these are passed to SubModP() and the result is stored in an 8-element int array that's then passed to MulModP() and from there to copyBigInt().
At first it wasn't clear to me where this error was coming from because the problem disappeared in debug mode, so I could not use the debugger. That's right, if you pass -g -G switches to NVCC, you get a working but extremely slow bitcrack binary.
I tried draconian measures in a attempt to fix this like unrolling the loop, changing the array assignment to memcpy(), qualifying it with __restrict__ and __align__ keywords and I even changed it to a #define statement but the destination and source arrays just don't want to be accessed (since these arrays cannot even be used in the parent function, the problem stems deeper). More bafflingly, assigning a constant to an element in the dest array or making a variable that's initialized to an element from src
works but this obviously breaks the elliptic curve stuff.
This is supposed to be performance-critical code so I did not attempt to change the static array to malloc.
For the uninitiated: this is where the bug is:
https://github.com/brichard19/BitCrack/blob/master/cudaMath/secp256k1.cuhCudaMath/secp256k1.cuh, everything in here are inline functions.
We arrive here from CudaKeySearchDevice via beginBatchAdd() and beginBatchAddWithDouble(). Both of these functions call MulModP for point multiplication. Methods like that need to copy to and from temporary arrays. Somehow the arrays being passed are not on an alignment boundary, and I'm honestly not sure what to do. (Of course, rewriting the whole secp256k1 module is also an option but really...? That's like opening a nut with a sledgehammer.)