Pages:
Author

Topic: BitCrack - A tool for brute-forcing private keys - page 65. (Read 77200 times)

full member
Activity: 1232
Merit: 242
Shooters Shoot...
why the hell so low speed with max  -b 70 -t 512 -p 2078

[2021-01-30.08:55:30] [Info] Compression: compressed
[2021-01-30.08:55:30] [Info] Starting at: 0000000000000000000000000000000000000000000000000000000000000001
[2021-01-30.08:55:30] [Info] Ending at:   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364140
[2021-01-30.08:55:30] [Info] Counting by: 0000000000000000000000000000000000000000000000000000000000000001
[2021-01-30.08:55:30] [Info] Compiling OpenCL kernels...
[2021-01-30.08:55:30] [Info] Initializing GeForce RTX 3080
[2021-01-30.08:55:36] [Info] Generating 74,475,520 starting points (2841.0MB)
[2021-01-30.08:55:42] [Info] 10.0%
[2021-01-30.08:55:43] [Info] 20.0%
[2021-01-30.08:55:43] [Info] 30.0%
[2021-01-30.08:55:43] [Info] 40.0%
[2021-01-30.08:55:43] [Info] 50.0%
[2021-01-30.08:55:44] [Info] 60.0%
[2021-01-30.08:55:44] [Info] 70.0%
[2021-01-30.08:55:44] [Info] 80.0%
[2021-01-30.08:55:44] [Info] 90.0%
[2021-01-30.08:55:45] [Info] 100.0%
[2021-01-30.08:55:45] [Info] Done
[00:00:00] 4545/10240MB | 1 target 780.92 MKey/s

OpenCL versus the Cuda version. clBitCrack versus cuBitCrack; have you tried cuBitCrack?
jr. member
Activity: 40
Merit: 7
why the hell so low speed with max  -b 70 -t 512 -p 2078

[2021-01-30.08:55:30] [Info] Compression: compressed
[2021-01-30.08:55:30] [Info] Starting at: 0000000000000000000000000000000000000000000000000000000000000001
[2021-01-30.08:55:30] [Info] Ending at:   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364140
[2021-01-30.08:55:30] [Info] Counting by: 0000000000000000000000000000000000000000000000000000000000000001
[2021-01-30.08:55:30] [Info] Compiling OpenCL kernels...
[2021-01-30.08:55:30] [Info] Initializing GeForce RTX 3080
[2021-01-30.08:55:36] [Info] Generating 74,475,520 starting points (2841.0MB)
[2021-01-30.08:55:42] [Info] 10.0%
[2021-01-30.08:55:43] [Info] 20.0%
[2021-01-30.08:55:43] [Info] 30.0%
[2021-01-30.08:55:43] [Info] 40.0%
[2021-01-30.08:55:43] [Info] 50.0%
[2021-01-30.08:55:44] [Info] 60.0%
[2021-01-30.08:55:44] [Info] 70.0%
[2021-01-30.08:55:44] [Info] 80.0%
[2021-01-30.08:55:44] [Info] 90.0%
[2021-01-30.08:55:45] [Info] 100.0%
[2021-01-30.08:55:45] [Info] Done
[00:00:00] 4545/10240MB | 1 target 780.92 MKey/s
newbie
Activity: 1
Merit: 0
Can someone help please?
is there a way to stop/resume in cmd terminal?
thanks in advance!
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Last summer I was running bitcrack on some 2060 without any problem. I don't understand all the lasts messages.

EDIT
oh, it's about driver version, ok

Yeah, the Ampere cards do not run on these drivers. Sad story.


Not right. It's bitcrack's kernel code that need to be update. Here the solution (GPUMath), if some developer can apply it in the bitcrack's kernel code

https://github.com/JeanLucPons/VanitySearch/commit/d3c1debb12233722f6ccc09ed3317769161a4773

I just see a bunch of __device__ functions in GPUMath.h, the other files are unrelated to this problem since they are C source code.

In that commit we have:

__device__ __forceinline__ uint32_t ctz(uint64_t x) {

__device__ void _DivStep62(uint64_t u[5],uint64_t v[5],
   int32_t* pos

__device__ void MatrixVecMulHalf(uint64_t dest[5],uint64_t u[5],uint64_t v[5],int64_t _11,int64_t _12,uint64_t* carry) {

__device__ void MatrixVecMul(uint64_t u[5],uint64_t v[5],int64_t _11,int64_t _12,int64_t _21,int64_t _22) {

__device__ uint64_t AddCh(uint64_t r[5],uint64_t a[5],uint64_t carry) {

The only thing they have in common with Bitcrack code is that these are also using array parameters and are somehow succeeding.



Here's what I did find in bitcrack code.

In cudaDeviceKeys.cu, elliptic curve-related code about x and y points:

Code:
__constant__ unsigned int *_xPtr[1];

__constant__ unsigned int *_yPtr[1];


__device__ unsigned int *ec::getXPtr()
{
return _xPtr[0];
}

__device__ unsigned int *ec::getYPtr()
{
return _yPtr[0];
}

Why are these a single-element array of a pointer in the first place? It looks redundant and the size of 1 is just making more opportunities to get pointer handling wrong.

It's relevant because after xPtr is copied to device memory in x, then this code is called that uses the arrays x and chain (and also inverse but that is initialized to fixed values so we know that's already aligned).

Code:
beginBatchAdd(_INC_X, x, chain, i, i, inverse);

chain is also initialized to _CHAIN which is also a 1-array of a pointer.

I read some things about __constant__ variables in constant memory and learned that it's read-only and can only be initialized during declaration but I could be wrong and may have to check again.



But I think this part could be the real problem:

Code:
// inside for loop of doIteration()
        unsigned int x[8];

        unsigned int digest[5];
        // ... more array decelerations inside loop, such as for y

I think because it's forcing CUDA to discard and allocate memory repeatedly, it might be putting this array on some unaligned address or something.
member
Activity: 131
Merit: 32
Friends, tell me the optimal parameters for the RTX 2060, b, t, r. Huh
-b 36 -t 512 -p 2800
-b 36 -t 258 -p 2500
newbie
Activity: 26
Merit: 2
Friends, tell me the optimal parameters for the RTX 2060, b, t, r. Huh
newbie
Activity: 2
Merit: 0
Last summer I was running bitcrack on some 2060 without any problem. I don't understand all the lasts messages.

EDIT
oh, it's about driver version, ok

Yeah, the Ampere cards do not run on these drivers. Sad story.


Not right. It's bitcrack's kernel code that need to be update. Here the solution (GPUMath), if some developer can apply it in the bitcrack's kernel code

https://github.com/JeanLucPons/VanitySearch/commit/d3c1debb12233722f6ccc09ed3317769161a4773



jr. member
Activity: 36
Merit: 3
Last summer I was running bitcrack on some 2060 without any problem. I don't understand all the lasts messages.

EDIT
oh, it's about driver version, ok

Yeah, the Ampere cards do not run on these drivers. Sad story.
sr. member
Activity: 661
Merit: 250
Last summer I was running bitcrack on some 2060 without any problem. I don't understand all the lasts messages.

EDIT
oh, it's about driver version, ok
full member
Activity: 1232
Merit: 242
Shooters Shoot...
Was going back to look at the issues you all are trying to fix.  This may or may not help you all. If you download the latest cuBitCrack from github, it will run with the GTX 10 series just fine, with any driver however, the only way I could get it to work on RTX 20xx cards was to roll back the driver to 452.

Not sure if you can look at what changed between the 452 driver and newer drivers that caused it to stop working on 20xx cards. unless it's a mere cuda runtime thing.

Also, I have noticed in another program that when I try and generate random keys with the 20xx or 30xx, that is when I get the misaligned or memory error, but only when I have a large input list (of addresses). Got me thinking with BitCrack, when you have a large list or the program starts generating all those starting points, that's where error is coming into play. I wonder if you choked down the starting points to say 32, 32, 32, if same error would exist.

When I use straight key to key search, no generation of random points, I have no issues with any series of cards on the other program.

Also, the new Ampere has 4 SMs, 4 x 32. But from what I have seen, it seems like random points, creating many points does something to throw the misaligned address/memory errors. Crazy stuff to say the least.

I'll try the 1x1 grid later suggested above. Imho we need someone with more CUDA knowledge to crack this one.
Code breaks differ from run to run (last one was readIntLSW (https://github.com/brichard19/BitCrack/blob/master/cudaMath/secp256k1.cuh#L110))..
Jean Luc is the CUDA master...I'll see if he has messed with the 30xx cards yet
jr. member
Activity: 36
Merit: 3
Was going back to look at the issues you all are trying to fix.  This may or may not help you all. If you download the latest cuBitCrack from github, it will run with the GTX 10 series just fine, with any driver however, the only way I could get it to work on RTX 20xx cards was to roll back the driver to 452.

Not sure if you can look at what changed between the 452 driver and newer drivers that caused it to stop working on 20xx cards. unless it's a mere cuda runtime thing.

Also, I have noticed in another program that when I try and generate random keys with the 20xx or 30xx, that is when I get the misaligned or memory error, but only when I have a large input list (of addresses). Got me thinking with BitCrack, when you have a large list or the program starts generating all those starting points, that's where error is coming into play. I wonder if you choked down the starting points to say 32, 32, 32, if same error would exist.

When I use straight key to key search, no generation of random points, I have no issues with any series of cards on the other program.

Also, the new Ampere has 4 SMs, 4 x 32. But from what I have seen, it seems like random points, creating many points does something to throw the misaligned address/memory errors. Crazy stuff to say the least.

I'll try the 1x1 grid later suggested above. Imho we need someone with more CUDA knowledge to crack this one.
Code breaks differ from run to run (last one was readIntLSW (https://github.com/brichard19/BitCrack/blob/master/cudaMath/secp256k1.cuh#L110))..
full member
Activity: 1232
Merit: 242
Shooters Shoot...
Edit:
Attempts to log, gather more info or whatever, will slow down the program just enough that it *works*.  Roll Eyes
Prob. the reason it works on older hardware. Its really the speed killing it.

Maybe the SMs in the GPU have so many threads per block running that they don't have enough time to align all the pointers for Bitcrack. The GPU does a lot of its own things at runtime. I never saw bitcrack do any alignment in the cracking loop, but neither did I check the parts responsible for initialization.

Could the initialization stage be launching kernels with invalid block sizes in a grid? Maybe there's a way to make it launch a 1x1 sized grid so we can see what happens.
Was going back to look at the issues you all are trying to fix.  This may or may not help you all. If you download the latest cuBitCrack from github, it will run with the GTX 10 series just fine, with any driver however, the only way I could get it to work on RTX 20xx cards was to roll back the driver to 452.

Not sure if you can look at what changed between the 452 driver and newer drivers that caused it to stop working on 20xx cards. unless it's a mere cuda runtime thing.

Also, I have noticed in another program that when I try and generate random keys with the 20xx or 30xx, that is when I get the misaligned or memory error, but only when I have a large input list (of addresses). Got me thinking with BitCrack, when you have a large list or the program starts generating all those starting points, that's where error is coming into play. I wonder if you choked down the starting points to say 32, 32, 32, if same error would exist.

When I use straight key to key search, no generation of random points, I have no issues with any series of cards on the other program.

Also, the new Ampere has 4 SMs, 4 x 32. But from what I have seen, it seems like random points, creating many points does something to throw the misaligned address/memory errors. Crazy stuff to say the least.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Edit:
Attempts to log, gather more info or whatever, will slow down the program just enough that it *works*.  Roll Eyes
Prob. the reason it works on older hardware. Its really the speed killing it.

Maybe the SMs in the GPU have so many threads per block running that they don't have enough time to align all the pointers for Bitcrack. The GPU does a lot of its own things at runtime. I never saw bitcrack do any alignment in the cracking loop, but neither did I check the parts responsible for initialization.

Could the initialization stage be launching kernels with invalid block sizes in a grid? Maybe there's a way to make it launch a 1x1 sized grid so we can see what happens.
jr. member
Activity: 36
Merit: 3


Yeah, that's the warning I was talking about  Sad . I'm glad you were able to at least narrow it down to Turing-to-Ampere arch changes though.

Perhaps there is a way in VS to print the contents of a raw pointer address that each variable is at? That should surely make an misaligned access error if we try to do that if it's indeed the variable that's not aligned. Alternatively we can just view the addresses in hexadecimal and see which ones are not divisible by 4,8, etc. I know gdb can do both of these things but I forgot the commands for them.

Docs say I should compile against c86,sm86, so will test what that does using legacy mode. Whatever that mode even does.

To my understanding, the difference between compute_* and sm_* is that the compute_ targets are for compiling the CUDA code into PTX, which is some kind of assembly code for GPUs. Basically there are different versions of assembly code specifications, and they're forward compatible with newer PTX versions (the so-called compute caps). In fact that's the reason brichard19 was able to leave the makefile at compute_35 all this time without bitcrack going to hell for everyone running a newer GPU family.

While sm_* is the specification of the binary, what we'd call in C/C++ land the "linking phase", and it absolutely has to match the compute cap for the specific GPU you're targeting in order to run on it. The GPU binaries are not forward-compatible, which means you can't run a CUDA binary corresponding on one GPU family on any other family (unless it's compute cap has the same major version).

For example, if you compile a CUDA program for Kepler's PTX architecture, which is what bitcrack was doing all this time, it's gonna work on Maxwell,Pascal,Volta,...etc. All families after it too.

But the binary itself won't work unless your GPU family has the same major version as inside the sm_ the program was compiled against (e.g Volta 7.0 and Turing 7.5, but not Ampere 8.6 or Pascal 6.1.

If you wanted to distribute a CUDA binary that works on all of those families, you gotta pass a low compute_ version and hen the sm_ versions for every family you want it to run on, e.g.  

Code:
-gencode=arch=compute_35,code=sm_35,sm_50,sm_52,sm_61,sm_75,sm_86

to make it run on every desktop GPU starting with Kepler (This excludes compute caps for embedded Tegra GPUs in game consoles, a few datacenter Tesla GPUs, and the Titan V Wink). Stuffing all those binaries inside a single program also turns out to make it very huge, perhaps the reason why video game installs are several GBs large.

At any rate, the blasted thing is supposed to work without passing any compute cap flags at all so maybe we're focusing on the wrong thing Smiley I'll study the docs some more because attempting to debug without understanding CUDA won't get us anywhere.

You're totally right, was just reading about these. Thanks for adding, you're amazing in explaining stuff btw.
The reason for trying this was because of the legacy build thing. I noticed the docs stating the following:

Code:
1.4.3. Independent Thread Scheduling Compatibility
NVIDIA GPUs since Volta architecture have Independent Thread Scheduling among threads in a warp. If the developer made assumptions about warp-synchronicity2, this feature can alter the set of threads participating in the executed code compared to previous architectures. Please see Compute Capability 7.0 in the Programming Guide for details and corrective actions. To aid migration to the NVIDIA Ampere GPU architecture, developers can opt-in to the Pascal scheduling model with the following combination of compiler options.

nvcc -gencode=arch=compute_60,code=sm_80 ...

And while debugging, I noticed my breakpoints changes using different specific versions.

My understanding of CUDA is at minimum, but I'm fascinated about it too much. It's a shame the person with CUDA knowledge didn't share any insights yet.

Edit:
Attempts to log, gather more info or whatever, will slow down the program just enough that it *works*.  Roll Eyes
Prob. the reason it works on older hardware. Its really the speed killing it.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Edit2:
So, to view the memory you gotta enable device debugging, but when enabled, it 'works' (slow af ofc). Great.

Yeah, that's the warning I was talking about  Sad . I'm glad you were able to at least narrow it down to Turing-to-Ampere arch changes though.

Perhaps there is a way in VS to print the contents of a raw pointer address that each variable is at? That should surely make an misaligned access error if we try to do that if it's indeed the variable that's not aligned. Alternatively we can just view the addresses in hexadecimal and see which ones are not divisible by 4,8, etc. I know gdb can do both of these things but I forgot the commands for them.

Docs say I should compile against c86,sm86, so will test what that does using legacy mode. Whatever that mode even does.

To my understanding, the difference between compute_* and sm_* is that the compute_ targets are for compiling the CUDA code into PTX, which is some kind of assembly code for GPUs. Basically there are different versions of assembly code specifications, and they're forward compatible with newer PTX versions (the so-called compute caps). In fact that's the reason brichard19 was able to leave the makefile at compute_35 all this time without bitcrack going to hell for everyone running a newer GPU family.

While sm_* is the specification of the binary, what we'd call in C/C++ land the "linking phase", and it absolutely has to match the compute cap for the specific GPU you're targeting in order to run on it. The GPU binaries are not forward-compatible, which means you can't run a CUDA binary corresponding on one GPU family on any other family (unless it's compute cap has the same major version).

For example, if you compile a CUDA program for Kepler's PTX architecture, which is what bitcrack was doing all this time, it's gonna work on Maxwell,Pascal,Volta,...etc. All families after it too.

But the binary itself won't work unless your GPU family has the same major version as inside the sm_ the program was compiled against (e.g Volta 7.0 and Turing 7.5, but not Ampere 8.6 or Pascal 6.1.

If you wanted to distribute a CUDA binary that works on all of those families, you gotta pass a low compute_ version and hen the sm_ versions for every family you want it to run on, e.g.  

Code:
-gencode=arch=compute_35,code=sm_35,sm_50,sm_52,sm_61,sm_75,sm_86

to make it run on every desktop GPU starting with Kepler (This excludes compute caps for embedded Tegra GPUs in game consoles, a few datacenter Tesla GPUs, and the Titan V Wink). Stuffing all those binaries inside a single program also turns out to make it very huge, perhaps the reason why video game installs are several GBs large.

At any rate, the blasted thing is supposed to work without passing any compute cap flags at all so maybe we're focusing on the wrong thing Smiley I'll study the docs some more because attempting to debug without understanding CUDA won't get us anywhere.
jr. member
Activity: 36
Merit: 3

I have just installed my 3070 and giving it a go, I've compiled the CUDA version a few times but only for older cards.

I hear that I have to roll back my driver to get it working for 3070, 3080 or 3090 cards, but not sure which one.  I can't get it to start at all right now on the RTX 3070, using the driver that comes with CUDA development kit 11.2.

Aside, I think I know where to fix this, if I can just get it to work on my card so I can give it a whirl :/

You wouldn't need to rollback your drivers tho? It should run using the latest CUDA & drivers (and ofc crash due to the error), atleast on windows.
It runs on legacy compatibility mode using _75 (with CUDA injector at ~500M/k/s). Yet nowhere it is documented how to compile against compatibility mode and what the effect of it even is.
https://docs.nvidia.com/cuda/ampere-compatibility-guide/

Docs say I should compile against c86,sm86, so will test what that does using legacy mode. Whatever that mode even does.

Edit:
c86,s86 makes legacy build crash. So it has something to do with Turing vs Ampere CUDA.

Edit2:
So, to view the memory you gotta enable device debugging, but when enabled, it 'works' (slow af ofc). Great.
full member
Activity: 1232
Merit: 242
Shooters Shoot...
Really wonder if someone was able to run this against compute_75 & what speed bitcrack would hit. I've been running a modified VanitySearch, doing 4.6GK/s on a single 3090. Sadly due to the 86k threads it trying to fill, it goes out of bounds now & then (GPU/GPUCompute.h:54). Just cannot wrap my head around that funny one yet. But besides of me trying to understand that & learning a lot, CUDA should be doing something near that speed on bitcrack too  Tongue

Neat idea. I might give that a go and submit a pull request or fork BitCrack with that function. It should be possible.

Edit: my repo is at https://github.com/bitcoinforktech/BitCrack.git which will have some updates in the next few days.

Yeah, cuda on bitcrack has this interesting problem on the new drivers. Will try with line info later, was just doing a quick run of your repo.

Code:
[2021-01-19.17:31:52] [Info] Error: misaligned address
========= Misaligned Shared or Local Address
=========     at 0x0000e610 in keyFinderKernelWithDouble(int, int)
=========     by thread (160,0,0) in block (0,0,0)

Edit:
Most fascinating thing about this issue, is that it runs my full test keyspace in debug exe (400M)[ofc slow af], the release crashes on the error above.

I have just installed my 3070 and giving it a go, I've compiled the CUDA version a few times but only for older cards.

I hear that I have to roll back my driver to get it working for 3070, 3080 or 3090 cards, but not sure which one.  I can't get it to start at all right now on the RTX 3070, using the driver that comes with CUDA development kit 11.2.

Aside, I think I know where to fix this, if I can just get it to work on my card so I can give it a whirl :/
I used either 452 or 456, but I have other cards attached as well.
jr. member
Activity: 32
Merit: 4
Really wonder if someone was able to run this against compute_75 & what speed bitcrack would hit. I've been running a modified VanitySearch, doing 4.6GK/s on a single 3090. Sadly due to the 86k threads it trying to fill, it goes out of bounds now & then (GPU/GPUCompute.h:54). Just cannot wrap my head around that funny one yet. But besides of me trying to understand that & learning a lot, CUDA should be doing something near that speed on bitcrack too  Tongue

Neat idea. I might give that a go and submit a pull request or fork BitCrack with that function. It should be possible.

Edit: my repo is at https://github.com/bitcoinforktech/BitCrack.git which will have some updates in the next few days.

Yeah, cuda on bitcrack has this interesting problem on the new drivers. Will try with line info later, was just doing a quick run of your repo.

Code:
[2021-01-19.17:31:52] [Info] Error: misaligned address
========= Misaligned Shared or Local Address
=========     at 0x0000e610 in keyFinderKernelWithDouble(int, int)
=========     by thread (160,0,0) in block (0,0,0)

Edit:
Most fascinating thing about this issue, is that it runs my full test keyspace in debug exe (400M)[ofc slow af], the release crashes on the error above.

I have just installed my 3070 and giving it a go, I've compiled the CUDA version a few times but only for older cards.

I hear that I have to roll back my driver to get it working for 3070, 3080 or 3090 cards, but not sure which one.  I can't get it to start at all right now on the RTX 3070, using the driver that comes with CUDA development kit 11.2.

Aside, I think I know where to fix this, if I can just get it to work on my card so I can give it a whirl :/
full member
Activity: 1232
Merit: 242
Shooters Shoot...
Quote
Just ends, its not that complicated. The CUDA part is just a little to much above my understanding atm. The Bitcrack parts are easier to understand for me at least.

You are correct, not complicated at all.  Just ends...that's already been done.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Btw: when running in legacy mode (old hardware compatible), it was running fine using nsight. I’m not sure what flag that is on regular CUDA builds yet, just pressed the wrong button and was waiting for it to crash, totally didn’t. Will check tomorrow what speed that was on, could be interesting as fast-fix.

Make sure you track where the pointers that were passed to submodp were initialized from. Specifically, if you increment an array pointer by 1 or 2 or something like that in host code and then hand it over to CUDA then it will crap itself. It's too bad that CUDA doesn't have a native 256-bit unsigned type yet. Not only would that be faster but then we could avoid all this trickery to fix it.

Maybe the minimum memory alignment bytes increased for newer GPUs?

Is there a flag in nvcc that'll activate this legacy mode you're talking about? It's kind of frustrating that the code pretends to be fine when using debug flags.
Pages:
Jump to: