BitCrack - A tool for brute-forcing private keys - page 65.

brainless

member

Activity: 348

Merit: 34

one cuda coder bitcoinforktech disapear from 21 jan 2021, anyone know where he go, hope he will be fine, not infected by covid19, if anyone know write here about him for not online from 21 jan

https://bitcointalksearch.org/user/bitcoinforktech-2863059

renedx

jr. member

Activity: 36

Merit: 3

Quote from: studyroom1 on January 30, 2021, 12:15:08 AM

cuda is not working yet but other gus post there gtx 3060 ti result with CL and that is even more than me , even 2070 with cl is more than mine
wtf

For OpenCL & max out optimized:
A 3090 RTX does ~2000 Mkey/s (overclocked)
A 3090 RTX does ~1800 Mkey/s (normal)
A 3080 RTX does ~1600 Mkey/s (overclocked)
A 3080 RTX does ~1400 Mkey/s (normal)

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: dimsontt on January 29, 2021, 03:12:37 PM

Can someone help please?
is there a way to stop/resume in cmd terminal?
thanks in advance!

Yeah there is, just pass --continue FILENAME to bitcrack and it will save its progress every 60 seconds to the file, and in the case of starting bitcrack with an existing file, it'll load its current keys and working data from there.

Quote from: studyroom1 on January 30, 2021, 12:01:50 AM

why the hell so low speed with max -b 70 -t 512 -p 2078

Someone posted benchmarks of Bitcrack running in OpenCL mode on a bunch of NVIDIA GPUs here and the results were not good. It could be because the OpenCL version does not use the thread and block parameters you pass as these are CUDA-specific things, or maybe the OpenCL implementation of keysearch code was just not hand-tuned, it is still alpha after all. uint256_t is used everywhere inside there and I highly doubt OpenCL has an atomic 256bit int type. Even CUDA only has as large as a 128-bit atomic int so when this stuff is implemented using smaller-sized variables the runtime goes up accordingly.

EDIT: Just checked the code, yeah ClKeySearchDevice.cpp does in fact use threads and blocks but they are not fully parallelized as in CUDA and are only made use of to call the function that generates the starting points and to call the _doStepKernelWithDouble function i.e there is still a loop running on the CPU that generates an ever increasing number of points but the actual computation function is called with the threads and blocks.

Perhaps that explains why the CUDA performance is better.

zahid888

member

Activity: 282

Merit: 20

the right steps towerds the goal

Quote from: studyroom1 on January 30, 2021, 12:15:08 AM

Quote from: WanderingPhilospher on January 30, 2021, 12:04:00 AM

Quote from: studyroom1 on January 30, 2021, 12:01:50 AM

why the hell so low speed with max -b 70 -t 512 -p 2078

[2021-01-30.08:55:30] [Info] Compression: compressed
[2021-01-30.08:55:30] [Info] Starting at: 0000000000000000000000000000000000000000000000000000000000000001
[2021-01-30.08:55:30] [Info] Ending at: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364140
[2021-01-30.08:55:30] [Info] Counting by: 0000000000000000000000000000000000000000000000000000000000000001
[2021-01-30.08:55:30] [Info] Compiling OpenCL kernels...
[2021-01-30.08:55:30] [Info] Initializing GeForce RTX 3080
[2021-01-30.08:55:36] [Info] Generating 74,475,520 starting points (2841.0MB)
[2021-01-30.08:55:42] [Info] 10.0%
[2021-01-30.08:55:43] [Info] 20.0%
[2021-01-30.08:55:43] [Info] 30.0%
[2021-01-30.08:55:43] [Info] 40.0%
[2021-01-30.08:55:43] [Info] 50.0%
[2021-01-30.08:55:44] [Info] 60.0%
[2021-01-30.08:55:44] [Info] 70.0%
[2021-01-30.08:55:44] [Info] 80.0%
[2021-01-30.08:55:44] [Info] 90.0%
[2021-01-30.08:55:45] [Info] 100.0%
[2021-01-30.08:55:45] [Info] Done
[00:00:00] 4545/10240MB | 1 target 780.92 MKey/s

OpenCL versus the Cuda version. clBitCrack versus cuBitCrack; have you tried cuBitCrack?

cuda is not working yet but other gus post there gtx 3060 ti result with CL and that is even more than me , even 2070 with cl is more than mine
wtf

-b 128 -t 256 -p 1024

[2021-01-30.14:44:54] [Info] Compression: compressed
[2021-01-30.14:44:54] [Info] Starting at: 0000000000000000000000000000000000000000000000000000000DDEEE9D0B
[2021-01-30.14:44:54] [Info] Ending at: 0000000000000000000000000000000000000000000000FFFFFFFFFFFFFFFFFF
[2021-01-30.14:44:54] [Info] Counting by: 0000000000000000000000000000000000000000000000000000001000000000
[2021-01-30.14:44:54] [Info] Compiling OpenCL kernels...
[2021-01-30.14:44:54] [Info] Initializing GeForce RTX 3060 Ti
[2021-01-30.14:44:56] [Info] Generating 33,554,432 starting points (1280.0MB)
[2021-01-30.14:44:59] [Info] 10.0%
[2021-01-30.14:44:59] [Info] 20.0%
[2021-01-30.14:45:00] [Info] 30.0%
[2021-01-30.14:45:00] [Info] 40.0%
[2021-01-30.14:45:00] [Info] 50.0%
[2021-01-30.14:45:00] [Info] 60.0%
[2021-01-30.14:45:00] [Info] 70.0%
[2021-01-30.14:45:00] [Info] 80.0%
[2021-01-30.14:45:01] [Info] 90.0%
[2021-01-30.14:45:01] [Info] 100.0%
[2021-01-30.14:45:01] [Info] Done
[2021-01-30.14:45:01] [Info] Loading addresses from 'C:/Users/Desktop/1.txt'
[2021-01-30.14:45:01] [Info] 104 addresses loaded (0.0MB)
GeForce RTX 3060 3072 / 8192MB | 104 targets 814.79 MKey/s (23,689,428,992 total) [00:00:27]

maybe you have to use that parameter.. still waiting for cubitcrack

studyroom1

jr. member

Activity: 40

Merit: 7

Quote from: WanderingPhilospher on January 30, 2021, 12:04:00 AM

Quote from: studyroom1 on January 30, 2021, 12:01:50 AM

why the hell so low speed with max -b 70 -t 512 -p 2078

[2021-01-30.08:55:30] [Info] Compression: compressed
[2021-01-30.08:55:30] [Info] Starting at: 0000000000000000000000000000000000000000000000000000000000000001
[2021-01-30.08:55:30] [Info] Ending at: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364140
[2021-01-30.08:55:30] [Info] Counting by: 0000000000000000000000000000000000000000000000000000000000000001
[2021-01-30.08:55:30] [Info] Compiling OpenCL kernels...
[2021-01-30.08:55:30] [Info] Initializing GeForce RTX 3080
[2021-01-30.08:55:36] [Info] Generating 74,475,520 starting points (2841.0MB)
[2021-01-30.08:55:42] [Info] 10.0%
[2021-01-30.08:55:43] [Info] 20.0%
[2021-01-30.08:55:43] [Info] 30.0%
[2021-01-30.08:55:43] [Info] 40.0%
[2021-01-30.08:55:43] [Info] 50.0%
[2021-01-30.08:55:44] [Info] 60.0%
[2021-01-30.08:55:44] [Info] 70.0%
[2021-01-30.08:55:44] [Info] 80.0%
[2021-01-30.08:55:44] [Info] 90.0%
[2021-01-30.08:55:45] [Info] 100.0%
[2021-01-30.08:55:45] [Info] Done
[00:00:00] 4545/10240MB | 1 target 780.92 MKey/s

OpenCL versus the Cuda version. clBitCrack versus cuBitCrack; have you tried cuBitCrack?

cuda is not working yet but other gus post there gtx 3060 ti result with CL and that is even more than me , even 2070 with cl is more than mine
wtf

WanderingPhilospher

full member

Activity: 1232

Merit: 242

Shooters Shoot...

Quote from: studyroom1 on January 30, 2021, 12:01:50 AM

why the hell so low speed with max -b 70 -t 512 -p 2078

[2021-01-30.08:55:30] [Info] Compression: compressed
[2021-01-30.08:55:30] [Info] Starting at: 0000000000000000000000000000000000000000000000000000000000000001
[2021-01-30.08:55:30] [Info] Ending at: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364140
[2021-01-30.08:55:30] [Info] Counting by: 0000000000000000000000000000000000000000000000000000000000000001
[2021-01-30.08:55:30] [Info] Compiling OpenCL kernels...
[2021-01-30.08:55:30] [Info] Initializing GeForce RTX 3080
[2021-01-30.08:55:36] [Info] Generating 74,475,520 starting points (2841.0MB)
[2021-01-30.08:55:42] [Info] 10.0%
[2021-01-30.08:55:43] [Info] 20.0%
[2021-01-30.08:55:43] [Info] 30.0%
[2021-01-30.08:55:43] [Info] 40.0%
[2021-01-30.08:55:43] [Info] 50.0%
[2021-01-30.08:55:44] [Info] 60.0%
[2021-01-30.08:55:44] [Info] 70.0%
[2021-01-30.08:55:44] [Info] 80.0%
[2021-01-30.08:55:44] [Info] 90.0%
[2021-01-30.08:55:45] [Info] 100.0%
[2021-01-30.08:55:45] [Info] Done
[00:00:00] 4545/10240MB | 1 target 780.92 MKey/s

OpenCL versus the Cuda version. clBitCrack versus cuBitCrack; have you tried cuBitCrack?

studyroom1

jr. member

Activity: 40

Merit: 7

why the hell so low speed with max -b 70 -t 512 -p 2078

[2021-01-30.08:55:30] [Info] Compression: compressed
[2021-01-30.08:55:30] [Info] Starting at: 0000000000000000000000000000000000000000000000000000000000000001
[2021-01-30.08:55:30] [Info] Ending at: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364140
[2021-01-30.08:55:30] [Info] Counting by: 0000000000000000000000000000000000000000000000000000000000000001
[2021-01-30.08:55:30] [Info] Compiling OpenCL kernels...
[2021-01-30.08:55:30] [Info] Initializing GeForce RTX 3080
[2021-01-30.08:55:36] [Info] Generating 74,475,520 starting points (2841.0MB)
[2021-01-30.08:55:42] [Info] 10.0%
[2021-01-30.08:55:43] [Info] 20.0%
[2021-01-30.08:55:43] [Info] 30.0%
[2021-01-30.08:55:43] [Info] 40.0%
[2021-01-30.08:55:43] [Info] 50.0%
[2021-01-30.08:55:44] [Info] 60.0%
[2021-01-30.08:55:44] [Info] 70.0%
[2021-01-30.08:55:44] [Info] 80.0%
[2021-01-30.08:55:44] [Info] 90.0%
[2021-01-30.08:55:45] [Info] 100.0%
[2021-01-30.08:55:45] [Info] Done
[00:00:00] 4545/10240MB | 1 target 780.92 MKey/s

dimsontt

newbie

Activity: 1

Merit: 0

Can someone help please?
is there a way to stop/resume in cmd terminal?
thanks in advance!

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: Kangoo* on January 23, 2021, 07:30:12 PM

Quote from: renedx on January 22, 2021, 10:50:32 AM

Quote from: WhyMe on January 22, 2021, 09:38:04 AM

Last summer I was running bitcrack on some 2060 without any problem. I don't understand all the lasts messages.

EDIT
oh, it's about driver version, ok

Yeah, the Ampere cards do not run on these drivers. Sad story.

Not right. It's bitcrack's kernel code that need to be update. Here the solution (GPUMath), if some developer can apply it in the bitcrack's kernel code

https://github.com/JeanLucPons/VanitySearch/commit/d3c1debb12233722f6ccc09ed3317769161a4773

I just see a bunch of __device__ functions in GPUMath.h, the other files are unrelated to this problem since they are C source code.

In that commit we have:

__device__ __forceinline__ uint32_t ctz(uint64_t x) {

__device__ void _DivStep62(uint64_t u[5],uint64_t v[5],
int32_t* pos

__device__ void MatrixVecMulHalf(uint64_t dest[5],uint64_t u[5],uint64_t v[5],int64_t _11,int64_t _12,uint64_t* carry) {

__device__ void MatrixVecMul(uint64_t u[5],uint64_t v[5],int64_t _11,int64_t _12,int64_t _21,int64_t _22) {

__device__ uint64_t AddCh(uint64_t r[5],uint64_t a[5],uint64_t carry) {

The only thing they have in common with Bitcrack code is that these are also using array parameters and are somehow succeeding.

Here's what I did find in bitcrack code.

In cudaDeviceKeys.cu, elliptic curve-related code about x and y points:

Code:

__constant__ unsigned int *_xPtr[1];

__constant__ unsigned int *_yPtr[1];

__device__ unsigned int *ec::getXPtr()
{
return _xPtr[0];
}

__device__ unsigned int *ec::getYPtr()
{
return _yPtr[0];
}

Why are these a single-element array of a pointer in the first place? It looks redundant and the size of 1 is just making more opportunities to get pointer handling wrong.

It's relevant because after xPtr is copied to device memory in x, then this code is called that uses the arrays x and chain (and also inverse but that is initialized to fixed values so we know that's already aligned).

Code:

beginBatchAdd(_INC_X, x, chain, i, i, inverse);

chain is also initialized to _CHAIN which is also a 1-array of a pointer.

I read some things about __constant__ variables in constant memory and learned that it's read-only and can only be initialized during declaration but I could be wrong and may have to check again.

But I think this part could be the real problem:

Code:

// inside for loop of doIteration()
unsigned int x[8];

unsigned int digest[5];
// ... more array decelerations inside loop, such as for y

I think because it's forcing CUDA to discard and allocate memory repeatedly, it might be putting this array on some unaligned address or something.

Lolo54

member

Activity: 131

Merit: 32

Quote from: elvis13 on January 24, 2021, 11:29:31 AM

Friends, tell me the optimal parameters for the RTX 2060, b, t, r. Huh

-b 36 -t 512 -p 2800
-b 36 -t 258 -p 2500

elvis13

newbie

Activity: 26

Merit: 2

Friends, tell me the optimal parameters for the RTX 2060, b, t, r. Huh

Kangoo*

newbie

Activity: 2

Merit: 0

Quote from: renedx on January 22, 2021, 10:50:32 AM

Quote from: WhyMe on January 22, 2021, 09:38:04 AM

Last summer I was running bitcrack on some 2060 without any problem. I don't understand all the lasts messages.

EDIT
oh, it's about driver version, ok

Yeah, the Ampere cards do not run on these drivers. Sad story.

Not right. It's bitcrack's kernel code that need to be update. Here the solution (GPUMath), if some developer can apply it in the bitcrack's kernel code

https://github.com/JeanLucPons/VanitySearch/commit/d3c1debb12233722f6ccc09ed3317769161a4773

renedx

jr. member

Activity: 36

Merit: 3

Quote from: WhyMe on January 22, 2021, 09:38:04 AM

Last summer I was running bitcrack on some 2060 without any problem. I don't understand all the lasts messages.

EDIT
oh, it's about driver version, ok

Yeah, the Ampere cards do not run on these drivers. Sad story.

WhyMe

sr. member

Activity: 661

Merit: 250

Last summer I was running bitcrack on some 2060 without any problem. I don't understand all the lasts messages.

EDIT
oh, it's about driver version, ok

WanderingPhilospher

full member

Activity: 1232

Merit: 242

Shooters Shoot...

Quote from: renedx on January 21, 2021, 12:32:07 PM

Quote from: WanderingPhilospher on January 21, 2021, 10:25:44 AM

Was going back to look at the issues you all are trying to fix. This may or may not help you all. If you download the latest cuBitCrack from github, it will run with the GTX 10 series just fine, with any driver however, the only way I could get it to work on RTX 20xx cards was to roll back the driver to 452.

Not sure if you can look at what changed between the 452 driver and newer drivers that caused it to stop working on 20xx cards. unless it's a mere cuda runtime thing.

Also, I have noticed in another program that when I try and generate random keys with the 20xx or 30xx, that is when I get the misaligned or memory error, but only when I have a large input list (of addresses). Got me thinking with BitCrack, when you have a large list or the program starts generating all those starting points, that's where error is coming into play. I wonder if you choked down the starting points to say 32, 32, 32, if same error would exist.

When I use straight key to key search, no generation of random points, I have no issues with any series of cards on the other program.

Also, the new Ampere has 4 SMs, 4 x 32. But from what I have seen, it seems like random points, creating many points does something to throw the misaligned address/memory errors. Crazy stuff to say the least.

I'll try the 1x1 grid later suggested above. Imho we need someone with more CUDA knowledge to crack this one.
Code breaks differ from run to run (last one was readIntLSW (https://github.com/brichard19/BitCrack/blob/master/cudaMath/secp256k1.cuh#L110))..

Jean Luc is the CUDA master...I'll see if he has messed with the 30xx cards yet

renedx

jr. member

Activity: 36

Merit: 3

Quote from: WanderingPhilospher on January 21, 2021, 10:25:44 AM

Was going back to look at the issues you all are trying to fix. This may or may not help you all. If you download the latest cuBitCrack from github, it will run with the GTX 10 series just fine, with any driver however, the only way I could get it to work on RTX 20xx cards was to roll back the driver to 452.

Not sure if you can look at what changed between the 452 driver and newer drivers that caused it to stop working on 20xx cards. unless it's a mere cuda runtime thing.

Also, I have noticed in another program that when I try and generate random keys with the 20xx or 30xx, that is when I get the misaligned or memory error, but only when I have a large input list (of addresses). Got me thinking with BitCrack, when you have a large list or the program starts generating all those starting points, that's where error is coming into play. I wonder if you choked down the starting points to say 32, 32, 32, if same error would exist.

When I use straight key to key search, no generation of random points, I have no issues with any series of cards on the other program.

Also, the new Ampere has 4 SMs, 4 x 32. But from what I have seen, it seems like random points, creating many points does something to throw the misaligned address/memory errors. Crazy stuff to say the least.

I'll try the 1x1 grid later suggested above. Imho we need someone with more CUDA knowledge to crack this one.
Code breaks differ from run to run (last one was readIntLSW (https://github.com/brichard19/BitCrack/blob/master/cudaMath/secp256k1.cuh#L110))..

WanderingPhilospher

full member

Activity: 1232

Merit: 242

Shooters Shoot...

Quote from: NotATether on January 21, 2021, 06:29:50 AM

Quote from: renedx on January 20, 2021, 01:59:54 PM

Edit:
Attempts to log, gather more info or whatever, will slow down the program just enough that it *works*. Roll Eyes

Prob. the reason it works on older hardware. Its really the speed killing it.

Maybe the SMs in the GPU have so many threads per block running that they don't have enough time to align all the pointers for Bitcrack. The GPU does a lot of its own things at runtime. I never saw bitcrack do any alignment in the cracking loop, but neither did I check the parts responsible for initialization.

Could the initialization stage be launching kernels with invalid block sizes in a grid? Maybe there's a way to make it launch a 1x1 sized grid so we can see what happens.

Was going back to look at the issues you all are trying to fix. This may or may not help you all. If you download the latest cuBitCrack from github, it will run with the GTX 10 series just fine, with any driver however, the only way I could get it to work on RTX 20xx cards was to roll back the driver to 452.

Not sure if you can look at what changed between the 452 driver and newer drivers that caused it to stop working on 20xx cards. unless it's a mere cuda runtime thing.

Also, I have noticed in another program that when I try and generate random keys with the 20xx or 30xx, that is when I get the misaligned or memory error, but only when I have a large input list (of addresses). Got me thinking with BitCrack, when you have a large list or the program starts generating all those starting points, that's where error is coming into play. I wonder if you choked down the starting points to say 32, 32, 32, if same error would exist.

When I use straight key to key search, no generation of random points, I have no issues with any series of cards on the other program.

Also, the new Ampere has 4 SMs, 4 x 32. But from what I have seen, it seems like random points, creating many points does something to throw the misaligned address/memory errors. Crazy stuff to say the least.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: renedx on January 20, 2021, 01:59:54 PM

Edit:
Attempts to log, gather more info or whatever, will slow down the program just enough that it *works*. Roll Eyes

Prob. the reason it works on older hardware. Its really the speed killing it.

Maybe the SMs in the GPU have so many threads per block running that they don't have enough time to align all the pointers for Bitcrack. The GPU does a lot of its own things at runtime. I never saw bitcrack do any alignment in the cracking loop, but neither did I check the parts responsible for initialization.

Could the initialization stage be launching kernels with invalid block sizes in a grid? Maybe there's a way to make it launch a 1x1 sized grid so we can see what happens.

renedx

jr. member

Activity: 36

Merit: 3

Quote from: NotATether on January 20, 2021, 01:04:31 PM

Yeah, that's the warning I was talking about Sad

. I'm glad you were able to at least narrow it down to Turing-to-Ampere arch changes though.

Perhaps there is a way in VS to print the contents of a raw pointer address that each variable is at? That should surely make an misaligned access error if we try to do that if it's indeed the variable that's not aligned. Alternatively we can just view the addresses in hexadecimal and see which ones are not divisible by 4,8, etc. I know gdb can do both of these things but I forgot the commands for them.

Quote from: renedx on January 20, 2021, 09:45:18 AM

Docs say I should compile against c86,sm86, so will test what that does using legacy mode. Whatever that mode even does.

To my understanding, the difference between compute_* and sm_* is that the compute_ targets are for compiling the CUDA code into PTX, which is some kind of assembly code for GPUs. Basically there are different versions of assembly code specifications, and they're forward compatible with newer PTX versions (the so-called compute caps). In fact that's the reason brichard19 was able to leave the makefile at compute_35 all this time without bitcrack going to hell for everyone running a newer GPU family.

While sm_* is the specification of the binary, what we'd call in C/C++ land the "linking phase", and it absolutely has to match the compute cap for the specific GPU you're targeting in order to run on it. The GPU binaries are not forward-compatible, which means you can't run a CUDA binary corresponding on one GPU family on any other family (unless it's compute cap has the same major version).

For example, if you compile a CUDA program for Kepler's PTX architecture, which is what bitcrack was doing all this time, it's gonna work on Maxwell,Pascal,Volta,...etc. All families after it too.

But the binary itself won't work unless your GPU family has the same major version as inside the sm_ the program was compiled against (e.g Volta 7.0 and Turing 7.5, but not Ampere 8.6 or Pascal 6.1.

If you wanted to distribute a CUDA binary that works on all of those families, you gotta pass a low compute_ version and hen the sm_ versions for every family you want it to run on, e.g.

Code:

-gencode=arch=compute_35,code=sm_35,sm_50,sm_52,sm_61,sm_75,sm_86

to make it run on every desktop GPU starting with Kepler (This excludes compute caps for embedded Tegra GPUs in game consoles, a few datacenter Tesla GPUs, and the Titan V Wink

). Stuffing all those binaries inside a single program also turns out to make it very huge, perhaps the reason why video game installs are several GBs large.

At any rate, the blasted thing is supposed to work without passing any compute cap flags at all so maybe we're focusing on the wrong thing

I'll study the docs some more because attempting to debug without understanding CUDA won't get us anywhere.

You're totally right, was just reading about these. Thanks for adding, you're amazing in explaining stuff btw.
The reason for trying this was because of the legacy build thing. I noticed the docs stating the following:

Code:

1.4.3. Independent Thread Scheduling Compatibility
NVIDIA GPUs since Volta architecture have Independent Thread Scheduling among threads in a warp. If the developer made assumptions about warp-synchronicity2, this feature can alter the set of threads participating in the executed code compared to previous architectures. Please see Compute Capability 7.0 in the Programming Guide for details and corrective actions. To aid migration to the NVIDIA Ampere GPU architecture, developers can opt-in to the Pascal scheduling model with the following combination of compiler options.

nvcc -gencode=arch=compute_60,code=sm_80 ...

And while debugging, I noticed my breakpoints changes using different specific versions.

My understanding of CUDA is at minimum, but I'm fascinated about it too much. It's a shame the person with CUDA knowledge didn't share any insights yet.

Edit:
Attempts to log, gather more info or whatever, will slow down the program just enough that it *works*. Roll Eyes

Prob. the reason it works on older hardware. Its really the speed killing it.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: renedx on January 20, 2021, 09:45:18 AM

Edit2:
So, to view the memory you gotta enable device debugging, but when enabled, it 'works' (slow af ofc). Great.

Yeah, that's the warning I was talking about Sad

. I'm glad you were able to at least narrow it down to Turing-to-Ampere arch changes though.

Perhaps there is a way in VS to print the contents of a raw pointer address that each variable is at? That should surely make an misaligned access error if we try to do that if it's indeed the variable that's not aligned. Alternatively we can just view the addresses in hexadecimal and see which ones are not divisible by 4,8, etc. I know gdb can do both of these things but I forgot the commands for them.

Quote from: renedx on January 20, 2021, 09:45:18 AM

Docs say I should compile against c86,sm86, so will test what that does using legacy mode. Whatever that mode even does.

To my understanding, the difference between compute_* and sm_* is that the compute_ targets are for compiling the CUDA code into PTX, which is some kind of assembly code for GPUs. Basically there are different versions of assembly code specifications, and they're forward compatible with newer PTX versions (the so-called compute caps). In fact that's the reason brichard19 was able to leave the makefile at compute_35 all this time without bitcrack going to hell for everyone running a newer GPU family.

While sm_* is the specification of the binary, what we'd call in C/C++ land the "linking phase", and it absolutely has to match the compute cap for the specific GPU you're targeting in order to run on it. The GPU binaries are not forward-compatible, which means you can't run a CUDA binary corresponding on one GPU family on any other family (unless it's compute cap has the same major version).

For example, if you compile a CUDA program for Kepler's PTX architecture, which is what bitcrack was doing all this time, it's gonna work on Maxwell,Pascal,Volta,...etc. All families after it too.

But the binary itself won't work unless your GPU family has the same major version as inside the sm_ the program was compiled against (e.g Volta 7.0 and Turing 7.5, but not Ampere 8.6 or Pascal 6.1.

If you wanted to distribute a CUDA binary that works on all of those families, you gotta pass a low compute_ version and hen the sm_ versions for every family you want it to run on, e.g.

Code:

-gencode=arch=compute_35,code=sm_35,sm_50,sm_52,sm_61,sm_75,sm_86

to make it run on every desktop GPU starting with Kepler (This excludes compute caps for embedded Tegra GPUs in game consoles, a few datacenter Tesla GPUs, and the Titan V Wink

). Stuffing all those binaries inside a single program also turns out to make it very huge, perhaps the reason why video game installs are several GBs large.

At any rate, the blasted thing is supposed to work without passing any compute cap flags at all so maybe we're focusing on the wrong thing

I'll study the docs some more because attempting to debug without understanding CUDA won't get us anywhere.

Topic: BitCrack - A tool for brute-forcing private keys - page 65. (Read 77647 times)