What in/why is the copyBigInt() causing error? Same error appears in multiple programs written prior to release of RTX 30xx cards.
It's not copyBigInt() itself that's problematic (it's a simple element-wise assignment) but one of the arrays passed to it which is not aligned. CUDA wants all arrays aligned to 32-but boundaries and one of the arrays that eventually reaches copyBigInt() comes from "xp" and "x" pointer arguments of beginBatchAdd()...these are passed to SubModP() and the result is stored in an 8-element int array that's then passed to MulModP() and from there to copyBigInt().
At first it wasn't clear to me where this error was coming from because the problem disappeared in debug mode, so I could not use the debugger. That's right, if you pass -g -G switches to NVCC, you get a working but extremely slow bitcrack binary.
I tried draconian measures in a attempt to fix this like unrolling the loop, changing the array assignment to memcpy(), qualifying it with __restrict__ and __align__ keywords and I even changed it to a #define statement but the destination and source arrays just don't want to be accessed (since these arrays cannot even be used in the parent function, the problem stems deeper). More bafflingly, assigning a constant to an element in the dest array or making a variable that's initialized to an element from src
works but this obviously breaks the elliptic curve stuff.
This is supposed to be performance-critical code so I did not attempt to change the static array to malloc.
For the uninitiated: this is where the bug is:
https://github.com/brichard19/BitCrack/blob/master/cudaMath/secp256k1.cuhCudaMath/secp256k1.cuh, everything in here are inline functions.
We arrive here from CudaKeySearchDevice via beginBatchAdd() and beginBatchAddWithDouble(). Both of these functions call MulModP for point multiplication. Methods like that need to copy to and from temporary arrays. Somehow the arrays being passed are not on an alignment boundary, and I'm honestly not sure what to do. (Of course, rewriting the whole secp256k1 module is also an option but really...? That's like opening a nut with a sledgehammer.)
Been following your debugging by hand, as the debugger runs versus the release crashing. I'm nowhere close to the base-function as you, but it seems I'm hitting a different path. You're saying it starts from "beginBatchAdd".
I know the following breaks the code, but just for finding the issue: if you comment out the following part
https://github.com/brichard19/BitCrack/blob/master/CudaKeySearchDevice/CudaKeySearchDevice.cu#L179-L190
The code runs for me (ofc its broken now).
The interesting part is, "doBatchInverse" as running the upfollowing loop will make it crash, while the loop never hits "completeBatchAdd".
May be hitting a different issue? Or did you mean "completeBatchAdd"?Edit:
nvm, I didn't undo my function overwrites. It indeed bubbles from subModP.
https://github.com/brichard19/BitCrack/blob/master/cudaMath/secp256k1.cuh#L646We're on the same track (i think), thank god
*digging*
Edit 2:
Installed all the proper tools to debug simultaneous threads. The following breakpoint got hit.
Thats it for now, time for sleep
Btw: when running in legacy mode (old hardware compatible), it was running fine using nsight. I’m not sure what flag that is on regular CUDA builds yet, just pressed the wrong button and was waiting for it to crash, totally didn’t. Will check tomorrow what speed that was on, could be interesting as fast-fix.
For staying in a certain range, (must be a small range); do you want it to end or push back into the range? Bitcrack ends, Kangaroo pushes back. Which route are you trying to go? To end, need last key function...
Just ends, its not that complicated. The CUDA part is just a little to much above my understanding atm. The Bitcrack parts are easier to understand for me at least.