Edit2:
So, to view the memory you gotta enable device debugging, but when enabled, it 'works' (slow af ofc). Great.
Yeah, that's the warning I was talking about
. I'm glad you were able to at least narrow it down to Turing-to-Ampere arch changes though.
Perhaps there is a way in VS to print the contents of a raw pointer address that each variable is at? That should surely make an misaligned access error if we try to do that if it's indeed the variable that's not aligned. Alternatively we can just view the addresses in hexadecimal and see which ones are not divisible by 4,8, etc. I know gdb can do both of these things but I forgot the commands for them.
Docs say I should compile against c86,sm86, so will test what that does using legacy mode. Whatever that mode even does.
To my understanding, the difference between compute_* and sm_* is that the compute_ targets are for compiling the CUDA code into PTX, which is some kind of assembly code for GPUs. Basically there are different versions of assembly code specifications, and they're forward compatible with newer PTX versions (the so-called compute caps). In fact that's the reason brichard19 was able to leave the makefile at compute_35 all this time without bitcrack going to hell for everyone running a newer GPU family.
While sm_* is the specification of the binary, what we'd call in C/C++ land the "linking phase", and it absolutely has to match the compute cap for the specific GPU you're targeting in order to run on it. The GPU binaries are
not forward-compatible, which means you can't run a CUDA binary corresponding on one GPU family on any other family (unless it's compute cap has the same major version).
For example, if you compile a CUDA program for Kepler's PTX architecture, which is what bitcrack was doing all this time, it's gonna work on Maxwell,Pascal,Volta,...etc. All families after it too.
But the
binary itself won't work unless your GPU family has the same major version as inside the sm_ the program was compiled against (e.g Volta 7.0 and Turing 7.5, but
not Ampere 8.6 or Pascal 6.1.
If you wanted to distribute a CUDA binary that works on all of those families, you gotta pass a low compute_ version and hen the sm_ versions for every family you want it to run on, e.g.
-gencode=arch=compute_35,code=sm_35,sm_50,sm_52,sm_61,sm_75,sm_86
to make it run on every desktop GPU starting with Kepler (This excludes compute caps for embedded Tegra GPUs in game consoles, a few datacenter Tesla GPUs, and the Titan V
). Stuffing all those binaries inside a single program also turns out to make it very huge, perhaps the reason why video game installs are several GBs large.
At any rate, the blasted thing is supposed to work without passing any compute cap flags at all so maybe we're focusing on the wrong thing
I'll study the docs some more because attempting to debug without understanding CUDA won't get us anywhere.