Checking brainwallet | Bitcointalksearch.org

PrivatePerson

member

Activity: 174

Merit: 12

https://github.com/ryancdotorg/brainflayer
Doesn't look like it?

iceland2k14

jr. member

Activity: 37

Merit: 68

Quote from: LoyceV on May 08, 2022, 04:33:41 AM

Actually, it looks like I'm in over my head already Sad

To run in Python3, only 2 input files are needed.
BTC Address File, for collision purpose ... Line by line... Name = btc_alltypes_address.txt
The Password file which is be to checked for brainwallet association. Again line by line.... Name = BugeHugePassFile.txt
Each Password will check 4 BTC Address [Compressed, Uncompressed, P2SH, Bech32]

Quote from: ?? on ??

Looks like .dll file for Windows user, while .so file for Linux user.

Yes Exactly.

Did OP @zielar managed to reach his satisfactory Speed through some tool. Will be interesting to know.

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: iceland2k14 on May 07, 2022, 12:09:31 PM

It may look too simple/generic.

Actually, it looks like I'm in over my head already Sad

Quote from: ?? on ??

What i meant is only speed of SHA-256 hashing. AFAIK Hashcat doesn't have option to brute-force brainwallet.

So I should be able to do something like inputlist > hashcat > bitcointool > addy, and then compare the addy to the existing long list?
I need to find time for this

iceland2k14

jr. member

Activity: 37

Merit: 68

It may look too simple/generic. But it comes very handy for such kind of works....
The 3 Files required for this code can be obtained from this link https://github.com/iceland2k14/secp256k1

Minor Update: To handle Big Password File @LoyceV bigger than System RAM

Code:

import secp256k1 as ice
import time
import signal, sys


input_file_is_passw = True      # Otherwise consider input file is hexkeys
#==============================================================================
btc = [line.split()[0] for line in open("btc_alltypes_address.txt",'r')]
print(f'{"-"*60}\n Read complete for Address File. \n{"-"*60}')
btc = set(btc)

#==============================================================================
def handler(signal_received, frame):
    print('\nSIGINT or CTRL-C detected. Exiting gracefully. BYE')
    sys.exit(0)
    
def chk(line, k, P):
    ac = ice.pubkey_to_address(0, True, P)
    au = ice.pubkey_to_address(0, False, P)
    a3 = ice.pubkey_to_address(1, True, P)
    ab = ice.pubkey_to_address(2, True, P)
    if ac in btc or au in btc or a3 in btc or ab in btc:
        with open('FoundTreasure.txt', 'a') as f:
            f.write(f'PassWord = {line}    Privatekey = {hex(k)} \n')

#==============================================================================
m = 0
with open("BugeHugePassFile.txt",'r') as f:
    st = time.time()
    signal.signal(signal.SIGINT, handler)
    for line in f:
        passw = line.rstrip()
        if input_file_is_passw:
            hexkey = ice.get_sha256(bytes(passw, 'utf8')).hex()
        else:
            hexkey = passw
        kk = int(hexkey, 16)
        P = ice.scalar_multiplication(kk)
        chk(passw, kk, P)
        m += 1

        if m % 10000 == 0:
            print(f'Speed: [{m/(time.time()-st):.0f}]   {m} Line checked from File. Current line : {passw}', end='\r')
#==============================================================================
print(f'{" "*130}\r {m} Line checked from File. ', end='\r')
print(f'\n\n Work Done !!\n{"-"*60}')

Code:

> python p2.py
------------------------------------------------------------
 Read complete for Address File.
------------------------------------------------------------
Speed: [22503]   12000000 Line checked from File. Current line : 4f9ea3109cf4c4292265504599a40a27dc6a7689a149c4687848695855026393
SIGINT or CTRL-C detected. Exiting gracefully. BYE

LoyceV

legendary

Activity: 3290

Merit: 16489

Thick-Skinned Gang Leader and Golden Feather 2021

Quote from: ABCbits on May 02, 2022, 06:57:31 AM

As @vjudeu said, you used un-optimized tool. You should use optimized tool such as hashcat. Even on single core VPS, i managed to get ~5000 kH/s.

Can you walk me through this (ELI5 style)? Say I have pbies's list of 200 brainwallets, and I want to use hashcat to get the associated Bitcoin addresses. I'm feeling like a n00b here, hashcat has too many options.

Once I've figured out the above, I can start on this: the input-file with passwords I'd like to check is 444 GB and has 40 billion entries.

I have a (dedicated) server at my disposal with:

Code:

Intel(R) Xeon(R) CPU E3-1270 V2 @ 3.50GHz
130 GB available diskspace (hdd)
16 GB RAM

CrunchyF

jr. member

Activity: 56

Merit: 26

Quote

Is it absolutely necessary to hash the inputs with SHA256, or can a different algorithm be used for the bloom filter, such that we can create a different hash type Z(inputs) = Y(SHA256(inputs)) and thus save computational time?

Of course, Z() and Y() do not need to be cryptographic, and it would be better if they weren't.

I'll need to check the academic archives if such transfusion of a hashed output has ever been explored (let alone discovered).

No absolutely not . SHA256 is far too slow for fast bloomfilter.
The best is to use murmurhash (non cryptographic hash function)

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: LoyceV on May 02, 2022, 02:05:35 AM

Quote from: PrimeNumber7 on May 01, 2022, 03:34:16 PM

you will need to use a hashing algorithm that can handle a sufficiently large number of items

I didn't expect SHA256 to be so slow. I just tested a simple bash loop, and get less than 1000 hashes per second. That won't scale well to 40 billion inputs.

Is it absolutely necessary to hash the inputs with SHA256, or can a different algorithm be used for the bloom filter, such that we can create a different hash type Z(inputs) = Y(SHA256(inputs)) and thus save computational time?

Of course, Z() and Y() do not need to be cryptographic, and it would be better if they weren't.

I'll need to check the academic archives if such transfusion of a hashed output has ever been explored (let alone discovered).

ymgve2

full member

Activity: 162

Merit: 230

Quote from: PawGo on May 04, 2022, 09:28:18 AM

Quote from: CrunchyF on May 04, 2022, 07:20:27 AM

Quote from: ymgve2 on May 04, 2022, 06:04:57 AM

Quote from: CrunchyF on May 04, 2022, 04:43:09 AM

I focused on small table, because i wanted to have big empty space for storing the biggest bloomfilter possible.

Bigger bloom filter is not necessarily better. A better approach is to have multiple smaller bloom filters, based on different parts of the hash lookup. If a 512MB bloom filter has a hit rate of 0.01, a single 1024MB bloom filter will have a hit rate of 0.005, but two independent 512MB bloom filters will have a hit rate of 0.01^2 = 0.0001

i have to experiment that. It's interesting

Indeed, I will try that too, in another project.
On the other hand, there is a number of false-positives which we must accept and test ourself.

Coming back to your previous question - I use (by default) 112 blocks (proc*4) and 384 threads, but I cannot guarantee that values are optimal.
Just right now I switched from static table (size declared on compile) to dynamically reserved, that gives possibility to create bigger tables - 24 bit is max for my 12GB card (it takes more than 60% of memory) - and 77% memory is taken by the whole program. Performance gained between 22 and 24 is not shocking, but still worth to consider.

How do you fit 24 bit tables into 8ish GB? From my calculations it would be almost 12GB (64 bytes per point, 11 different tables) - I assume you don't use compressed points, since recovering y would take too much time)

PawGo

legendary

Activity: 952

Merit: 1386

Quote from: CrunchyF on May 04, 2022, 07:20:27 AM

Quote from: ymgve2 on May 04, 2022, 06:04:57 AM

Quote from: CrunchyF on May 04, 2022, 04:43:09 AM

I focused on small table, because i wanted to have big empty space for storing the biggest bloomfilter possible.

Bigger bloom filter is not necessarily better. A better approach is to have multiple smaller bloom filters, based on different parts of the hash lookup. If a 512MB bloom filter has a hit rate of 0.01, a single 1024MB bloom filter will have a hit rate of 0.005, but two independent 512MB bloom filters will have a hit rate of 0.01^2 = 0.0001

i have to experiment that. It's interesting

Indeed, I will try that too, in another project.
On the other hand, there is a number of false-positives which we must accept and test ourself.

Coming back to your previous question - I use (by default) 112 blocks (proc*4) and 384 threads, but I cannot guarantee that values are optimal.
Just right now I switched from static table (size declared on compile) to dynamically reserved, that gives possibility to create bigger tables - 24 bit is max for my 12GB card (it takes more than 60% of memory) - and 77% memory is taken by the whole program. Performance gained between 22 and 24 is not shocking, but still worth to consider.

CrunchyF

jr. member

Activity: 56

Merit: 26

Quote from: ymgve2 on May 04, 2022, 06:04:57 AM

Quote from: CrunchyF on May 04, 2022, 04:43:09 AM

I focused on small table, because i wanted to have big empty space for storing the biggest bloomfilter possible.

Bigger bloom filter is not necessarily better. A better approach is to have multiple smaller bloom filters, based on different parts of the hash lookup. If a 512MB bloom filter has a hit rate of 0.01, a single 1024MB bloom filter will have a hit rate of 0.005, but two independent 512MB bloom filters will have a hit rate of 0.01^2 = 0.0001

i have to experiment that. It's interesting

but look here the probability of hit rate/number of hash function dont seems to follow a linear rule...
with this bloom filter calculator if you increase the number of hash function you can increase the probability of hit rate too.

https://hur.st/bloomfilter/?n=80M&p=1.0E-7&m=&k=40

So what's the difference beetween one bloomfilter of 1G with k=40 (for ex) and two bloomfilter of 512M with k=20 both?

I used this tool to find the minimum of hash function for the minimum of hit ratio at my desired # of entries and size of filter.
So i'm not sure if i cut my bf in two the probability of no-hint will be squared.

ymgve2

full member

Activity: 162

Merit: 230

Quote from: CrunchyF on May 04, 2022, 04:43:09 AM

I focused on small table, because i wanted to have big empty space for storing the biggest bloomfilter possible.

Bigger bloom filter is not necessarily better. A better approach is to have multiple smaller bloom filters, based on different parts of the hash lookup. If a 512MB bloom filter has a hit rate of 0.01, a single 1024MB bloom filter will have a hit rate of 0.005, but two independent 512MB bloom filters will have a hit rate of 0.01^2 = 0.0001

PawGo

legendary

Activity: 952

Merit: 1386

Quote from: CrunchyF on May 04, 2022, 04:43:09 AM

Yes it is my first Cuda project and i think that an optimisation is possible with the use of the different type of memory of the GPU (Global and shared memory) beacuse the time access are different.

To copy my tables in the GPU mem i use the standard function :

cudaMemcpy(b, a, ..., cudaMemcpyHostToDevice);
so i'm not quite sure in which type of GPU mem the tables are loaded.

It depends how you declare that in the device code.
Normally you write to global memory. If it is with __constant__ directive, it will be faster, but teh size must be smaller.
__shared__ is a available only for single block, so you must rewrite to that memory each time you launch kernel. It is also limited.

CrunchyF

jr. member

Activity: 56

Merit: 26

Quote from: ymgve2 on May 04, 2022, 03:47:09 AM

Interesting approach - it seems other solutions focus more on larger tables - does the speed advantage of having the tables in shared memory outweigh the disadvantage of doing 15 additions vs for example 11 additions with 22 bit tables?

Here's a free tip that might gain some speed: Instead of doing pure addition, you can do addition/subtraction from a "middle" value like 0x8000800080008000800080008000.... - you save one bit of table size, and negating a point is quite easy.

Yes it is my first Cuda project and i think that an optimisation is possible with the use of the different type of memory of the GPU (Global and shared memory) beacuse the time access are different.

To copy my tables in the GPU mem i use the standard function :

cudaMemcpy(b, a, ..., cudaMemcpyHostToDevice);
so i'm not quite sure in which type of GPU mem the tables are loaded.

I use a 16bit indexed table because it's easy to cast a 256 integer array in 16 parts with the (uint16_t *) operator and u d'ont have to code a specific function .

the cost of the split function of 256bits in 22bits chunk, is probably not negligeable.

My 2 table (x and y) have a size of 2*32MB so if u use a 22bits table u will be around (2^6)*2*32MB = 4096 MB.. half of the memory of my RTX3070!

But anyway, as u say the cost of finding a value in a 4GB table compared to a a 32MB is probably (i dont know really) not the same.

I focused on small table, because i wanted to have big empty space for storing the biggest bloomfilter possible.

the big optimisation would be in the coding of a well syncronised batch modular inversion because with the JeanLuc Pons code you are obliged to wait that evry thread of the batch finish the multiplication
But the algorithm not seems to be easy

Quote

Here's a free tip that might gain some speed: Instead of doing pure addition, you can do addition/subtraction from a "middle" value like 0x8000800080008000800080008000.... - you save one bit of table size, and negating a point is quite easy.

Interesting ... but only for big table no?

ymgve2

full member

Activity: 162

Merit: 230

Quote from: CrunchyF on May 03, 2022, 02:01:53 PM

I precompute a table of kG value of 16*65535 values with x and y indexed by

0001->FFFF
0001 0000->FFFF 0000
0001 0000 0000->FFFF 0000 0000
...

I load this 2 table one for x one for y in shared memory of the GPU at the start of the program

Interesting approach - it seems other solutions focus more on larger tables - does the speed advantage of having the tables in shared memory outweigh the disadvantage of doing 15 additions vs for example 11 additions with 22 bit tables?

Here's a free tip that might gain some speed: Instead of doing pure addition, you can do addition/subtraction from a "middle" value like 0x8000800080008000800080008000.... - you save one bit of table size, and negating a point is quite easy.

CrunchyF

jr. member

Activity: 56

Merit: 26

Quote from: PawGo on May 03, 2022, 08:04:55 AM

Now I have 67Mkeys on rtx3060 using 22bit table and 32bit version of library (164Mkeys on 3090). I plan migration to 64bit version, as since CUDA 11.6 __int128 is natively supported. The stupid problem - 128bit ints are still not supported by Visual Studio compiler, so I will try to compile & launch it in ubuntu at WSL (or just pure linux machine).

Very good performance. I think you will be faster than me on a RTX 3070

What is your grid/block config ?
what the size in GPU memory of your 22bits precomputed table?
my 16bits table have a size of 33MB
How do you split your 256bits k (in k.G) in 22bits chunks?
Do u do your final inversion in a batch or directly in each thread?

CrunchyF

jr. member

Activity: 56

Merit: 26

Quote from: ymgve2 on May 03, 2022, 08:00:32 AM

Quote from: CrunchyF on May 03, 2022, 04:12:34 AM

Quote from: PawGo on May 02, 2022, 11:50:27 AM

Quote from: ABCbits on May 02, 2022, 06:57:31 AM

Quote from: LoyceV on May 02, 2022, 02:05:35 AM

Quote from: PrimeNumber7 on May 01, 2022, 03:34:16 PM

you will need to use a hashing algorithm that can handle a sufficiently large number of items

I didn't expect SHA256 to be so slow. I just tested a simple bash loop, and get less than 1000 hashes per second. That won't scale well to 40 billion inputs.

As @vjudeu said, you used un-optimized tool. You should use optimized tool such as hashcat. Even on single core VPS, i managed to get ~5000 kH/s.

Currently I work on piece of software for someone who has lost access to his brainwallet. He knows the way how to produce the phrase, but as it contains some (unknown) number, we must check many possible phrases. The list of potential addresses is very small (just a few), so it is not a factor which slows us down - it is rather priv->pub generation. Initial sha256 and later Hash160 are very fast. The main problem would be to compare generated hash160 against huge database of addresses (& of course generation for both compressed and uncompressed keys). At this stage of project we have speed 47Mkey on RTX 3060 and a little more than 100Mkeys on 3090 - but we are progressing every day Wink

I have developped a tool like this with about the same performance than PawGo one (80 MKeys on a RTX 3070) .
The solution to check on a big list of hash160 is to create a bloomfilter.
For my experiment i build a bloomfilter of about 80Mega hash160 (typically the size of the utxos db).
the binary size is about 1.2 GB with an error rate of 1e-16. (that means that u will have a false positiv every 10e16 lookup in the bloomfilter).
You just have to load this bloomfilter in the GPU RAM to speed up the process.
The lookup in the filter is very fast compared to the SHA256("....)->EC multiplication->hash160 of the publickey process.
So with my tool u can check about 100M hash160 at the speed of 80M sha256(password candidate) on a RTX3070.

If tou well define the pattern of the brainwallet you want to find. I can help u

That's a pretty impressive speed - do you have any special tricks to make the EC multiplication fast?

UFFF ok it was a lot of work but im aware to share (open source philosophy):

this is the best cuda grid/block config i found for the RTX 3070 (48 SM cores)

#define GRID_SIZE 1024*4*5
#define BLOCK_SIZE 256

I precompute a table of kG value of 16*65535 values with x and y indexed by

0001->FFFF
0001 0000->FFFF 0000
0001 0000 0000->FFFF 0000 0000
...

I load this 2 table one for x one for y in shared memory of the GPU at the start of the program

so for a 256 scalar multiplication you just have to decompose your 256bits k (in k.G) in 16 parts of 16 bits and do 16 sequentials points addition
I do each the addition in jacobian coordinate given by the relation (x,y,1) + (x',y',z') (the first term is from the EC table and the second term is the intermediate result of the addition process)
By this way you avoid 16 slow inversion (i think this is the most speed up trick)

The final inversion is done with the Fast Modular Inversion (Delayed Right Shift 62 bits) like in the Jean-Luc Pons Kangaroo program.
https://github.com/JeanLucPons/Kangaroo

PawGo

legendary

Activity: 952

Merit: 1386

Now I have 67Mkeys on rtx3060 using 22bit table and 32bit version of library (164Mkeys on 3090). I plan migration to 64bit version, as since CUDA 11.6 __int128 is natively supported. The stupid problem - 128bit ints are still not supported by Visual Studio compiler, so I will try to compile & launch it in ubuntu at WSL (or just pure linux machine).

ymgve2

full member

Activity: 162

Merit: 230

Quote from: CrunchyF on May 03, 2022, 04:12:34 AM

Quote from: PawGo on May 02, 2022, 11:50:27 AM

Quote from: ABCbits on May 02, 2022, 06:57:31 AM

Quote from: LoyceV on May 02, 2022, 02:05:35 AM

Quote from: PrimeNumber7 on May 01, 2022, 03:34:16 PM

you will need to use a hashing algorithm that can handle a sufficiently large number of items

I didn't expect SHA256 to be so slow. I just tested a simple bash loop, and get less than 1000 hashes per second. That won't scale well to 40 billion inputs.

As @vjudeu said, you used un-optimized tool. You should use optimized tool such as hashcat. Even on single core VPS, i managed to get ~5000 kH/s.

Currently I work on piece of software for someone who has lost access to his brainwallet. He knows the way how to produce the phrase, but as it contains some (unknown) number, we must check many possible phrases. The list of potential addresses is very small (just a few), so it is not a factor which slows us down - it is rather priv->pub generation. Initial sha256 and later Hash160 are very fast. The main problem would be to compare generated hash160 against huge database of addresses (& of course generation for both compressed and uncompressed keys). At this stage of project we have speed 47Mkey on RTX 3060 and a little more than 100Mkeys on 3090 - but we are progressing every day Wink

I have developped a tool like this with about the same performance than PawGo one (80 MKeys on a RTX 3070) .
The solution to check on a big list of hash160 is to create a bloomfilter.
For my experiment i build a bloomfilter of about 80Mega hash160 (typically the size of the utxos db).
the binary size is about 1.2 GB with an error rate of 1e-16. (that means that u will have a false positiv every 10e16 lookup in the bloomfilter).
You just have to load this bloomfilter in the GPU RAM to speed up the process.
The lookup in the filter is very fast compared to the SHA256("....)->EC multiplication->hash160 of the publickey process.
So with my tool u can check about 100M hash160 at the speed of 80M sha256(password candidate) on a RTX3070.

If tou well define the pattern of the brainwallet you want to find. I can help u

That's a pretty impressive speed - do you have any special tricks to make the EC multiplication fast?

PrimeNumber7

copper member

Activity: 1666

Merit: 1901

Amazon Prime Member #7

Quote from: LoyceV on May 02, 2022, 02:05:35 AM

Quote from: PrimeNumber7 on May 01, 2022, 03:34:16 PM

you will need to use a hashing algorithm that can handle a sufficiently large number of items

I didn't expect SHA256 to be so slow. I just tested a simple bash loop, and get less than 1000 hashes per second. That won't scale well to 40 billion inputs.

When creating a set, you generally do not want to use a hashing algorithm that will result in zero collisions for any number of inputs.

A general solution is to use a linked-list when there are collisions. This will result in a worse big O notation, however, it should improve actual runtime.

CrunchyF

jr. member

Activity: 56

Merit: 26

Quote from: PawGo on May 02, 2022, 11:50:27 AM

Quote from: ABCbits on May 02, 2022, 06:57:31 AM

Quote from: LoyceV on May 02, 2022, 02:05:35 AM

Quote from: PrimeNumber7 on May 01, 2022, 03:34:16 PM

you will need to use a hashing algorithm that can handle a sufficiently large number of items

I didn't expect SHA256 to be so slow. I just tested a simple bash loop, and get less than 1000 hashes per second. That won't scale well to 40 billion inputs.

As @vjudeu said, you used un-optimized tool. You should use optimized tool such as hashcat. Even on single core VPS, i managed to get ~5000 kH/s.

Currently I work on piece of software for someone who has lost access to his brainwallet. He knows the way how to produce the phrase, but as it contains some (unknown) number, we must check many possible phrases. The list of potential addresses is very small (just a few), so it is not a factor which slows us down - it is rather priv->pub generation. Initial sha256 and later Hash160 are very fast. The main problem would be to compare generated hash160 against huge database of addresses (& of course generation for both compressed and uncompressed keys). At this stage of project we have speed 47Mkey on RTX 3060 and a little more than 100Mkeys on 3090 - but we are progressing every day Wink

I have developped a tool like this with about the same performance than PawGo one (80 MKeys on a RTX 3070) .
The solution to check on a big list of hash160 is to create a bloomfilter.
For my experiment i build a bloomfilter of about 80Mega hash160 (typically the size of the utxos db).
the binary size is about 1.2 GB with an error rate of 1e-16. (that means that u will have a false positiv every 10e16 lookup in the bloomfilter).
You just have to load this bloomfilter in the GPU RAM to speed up the process.
The lookup in the filter is very fast compared to the SHA256("....)->EC multiplication->hash160 of the publickey process.
So with my tool u can check about 100M hash160 at the speed of 80M sha256(password candidate) on a RTX3070.

If tou well define the pattern of the brainwallet you want to find. I can help u

Topic: Checking brainwallet (Read 904 times)