Pages:
Author

Topic: VanitySearch (Yet another address prefix finder) - page 52. (Read 32072 times)

sr. member
Activity: 462
Merit: 701
Hi, I've just downloaded the VanitySearch Master, it works perfectly if I add "volatile" in this piece of code:

OK, which release of gcc are you using for compiling VanitySearch (not the CUDA code) ?
legendary
Activity: 1932
Merit: 2077
Hi, I've just downloaded the VanitySearch Master, it works perfectly if I add "volatile" in this piece of code:

Code:
void Int::ModSquareK1(Int *a) {

#ifndef WIN64
#if __GNUC__ <= 6
  #warning "GCC lass than 7 detected, upgrade gcc to get best perfromance"
  volatile unsigned char c;   <--
#else
  volatile unsigned char c;  <--
#endif
#else
  unsigned char c;
#endif
sr. member
Activity: 462
Merit: 701
Yes, today the default is to free only one core when GPU is enabled, it will change this to number of GPU.
donator
Activity: 4760
Merit: 4323
Leading Crypto Sports Betting & Casino Platform
Yes, This is because with -t 8, your CPU become a bottleneck and cannot handle GPU/CPU exchange.
When having good GPU keyrate, it is generally better to free 1 CPU core per GPU.

I think most users with newer GPUs would benefit from the power efficiency gains of running with -t 0. I would even argue that should be the default when a GPU is detected instead of the other way around where you have to enable GPUs.

Edit:  I'd also like to see the version number shown with the startup information.
sr. member
Activity: 462
Merit: 701
Do you recognize this crash error?

No I never experienced this crash. Thanks for the infos Wink

Is this included in your roadmap?

Salut Wink
I'm not yet familiar with P2SH addresses, I have to learn in detail. May be for 1-to-1 multisig P2SH.

Nice work anyway!

Thanks Wink

It is very strange with the process slower than without it.

Yes, This is because with -t 8, your CPU become a bottleneck and cannot handle GPU/CPU exchange.
When having good GPU keyrate, it is generally better to free 1 CPU core per GPU.

Jean_Luc, thank you for your hard work. If you break execution? Whether to keep VanitySearch a result?

If you are using a passphrase, and if you want to restart a search, you have to change your passphrase (1 character is enough) otherwise you will recompute exactly the same thing. If you're using the default random seed, the seed will change so you won't recompute the same thing, no need to save anything.
But I recommend to use a passphrase in order to generate safe private keys.

newbie
Activity: 7
Merit: 1
Win10, Cuda 10
i7 3700k, 8 Gb RAM

Code:
vanitysearch -stop -t 0 -gpu -gpuId 0 -i input_addres.txt -o output_file.txt
Search: 1Testtttt [Compressed]
Start Sun Mar 24 17:22:35 2019
Base Key:E50C09A69B313FCC6480B3390C47BBD55D6FFFEEBBC36D3881E011AE0330275
Number of CPU thread: 0
GPU: GPU #0 GeForce GTX 1080 Ti (28x128 cores) Grid(224x128)
967.926 MK/s (GPU 967.926 MK/s) (2^32.44) [P 0.00%][50.00% in 24.9d][0]0]

Code:
vanitysearch -stop -t 8 -gpu -gpuId 0 -i input_addres.txt -o output_file.txt
Difficulty: 2988734397852221
Search: 1Testtttt [Compressed]
Start Sun Mar 24 17:26:34 2019
Base Key:912441F08928FCEF7B5D6F9A1232221AF9FF3F6E653586F9146625C436060099
Number of CPU thread: 8
GPU: GPU #0 GeForce GTX 1080 Ti (28x128 cores) Grid(224x128)
914.418 MK/s (GPU 896.216 MK/s) (2^33.38) [P 0.00%][50.00% in 26.3d][0]0]

It is very strange with the process slower than without it.

Jean_Luc, thank you for your hard work. If you break execution? Whether to keep VanitySearch a result?

legendary
Activity: 1484
Merit: 1491
I forgot more than you will ever know.
Salut Jean-Luc Smiley

Do you plan to add support for P2SH (segwit starting with 3) adresses anytime soon to your tool? That would be a nice to have.

For instance this project by nullios implemented both P2SH and bech32 addies.

Is this included in your roadmap?

Nice work anyway!
jr. member
Activity: 38
Merit: 18
Here is the code of oclvanitygen to perform an addition with carry:
Code:
#define bn_addc_word(r, a, b, t, c) do { \
t = a + b + c; \
c = (t < a) ? 1 : ((c & (t == a)) ? 1 : 0); \
r = t; \
} while (0)

This code maybe have problem, look

post moved to https://bitcointalksearch.org/topic/m.52110068
sr. member
Activity: 462
Merit: 701
for the moment only on linux but it seems to me that jean_luc try or will try to adjust for windows also .... it is more difficult than for linux I think

Yes , on Windows no way to set up CUDA SDK 8.0 if a recent compiler (VC2017) is installed, even if the good one (VC2013) is also installed. The SDK setup fails. So the only solution is to start from a fresh install without VC2017 installed.

Also, any chance of OpenCL or is it going to only be only CUDA?
Thanks,
Dave

The problem with OpenCL is that I don't know how to access to the carry flag and how to perform a wide 64bit multiplication (i64xi64=>i128).

For instance:

Here is the code of oclvanitygen to perform an addition with carry:

Code:
#define bn_addc_word(r, a, b, t, c) do { \
t = a + b + c; \
c = (t < a) ? 1 : ((c & (t == a)) ? 1 : 0); \
r = t; \
} while (0)

This can be reduced to a single adc instruction with CUDA (and also with Visual C++, gcc, etc...) !
Some OpenCL driver compilers are smart enough to understand this code and reduce it to a single adc instruction but not all !

For the wide 64bit multiplication (i64xi64=>i128), CUDA offer the needed instructions (mul.lo.u64 and mul.hi.u64), but with OpenCL is seems that the only way is to use 32bit integer and to use 64bit integer to perform the multiplication (i32xi32=>i64).

If an OpenCL expert know how to perform this efficiently, it would be great.

legendary
Activity: 3500
Merit: 6320
Crypto Swap Exchange
Also, any chance of OpenCL or is it going to only be only CUDA?
Thanks,
Dave
member
Activity: 117
Merit: 32
An other report from a user using CUDA 8 and gcc 4.8 on a GeForce GTX 460. It works.


Does CUDA 8 work on Windows or only Linux?

-Dave
for the moment only on linux but it seems to me that jean_luc try or will try to adjust for windows also .... it is more difficult than for linux I think
legendary
Activity: 3500
Merit: 6320
Crypto Swap Exchange
An other report from a user using CUDA 8 and gcc 4.8 on a GeForce GTX 460. It works.


Does CUDA 8 work on Windows or only Linux?

-Dave
legendary
Activity: 1932
Merit: 2077
How it is possible??  Huh

Found by the CPU ? Try with -t 0...

Ok! Mystery solved!  Smiley
sr. member
Activity: 462
Merit: 701
How it is possible??  Huh

Found by the CPU ? Try with -t 0...
legendary
Activity: 1932
Merit: 2077
Very strange error.

If I mod the function __device__ void _ModMult(uint64_t *r, uint64_t *a, uint64_t *b)   in any way, for example like this:

Code:
  // Reduce from 320 to 256 
  UADD1(t[4],0ULL);
  UMULLO(al,t[4], 0x1000003D1ULL);
  UMULHI(ah,t[4], 0x1000003D1ULL);
  UADDO(r[0],r512[0], al);
  UADDC(r[1],r512[1], ah);
  UADDC(r[2],r512[2], 0ULL);
  UADD(r[3],r512[3], 0ULL);

  UADD1(r[3],0x07ULL);  <-- error!!!

I got all errors like it should be with the check option:

Code:
CPU found 1539 items
GPU: point   correct [0/271]
GPU: endo #1 correct [0/248]
GPU: endo #2 correct [0/260]
GPU: sym/point   correct [0/255]
GPU: sym/endo #1 correct [0/265]
GPU: sym/endo #2 correct [0/240]
GPU/CPU check Failed !

but I got instead the correct result with the standard command:

Code:
~/VanitySearch$ ./VanitySearch -stop -t 7 -gpu 1111
Difficulty: 16777216
Search: 1111 [Compressed]
Start Sat Mar 23 18:39:22 2019
Base Key:12FF1E3D528DC8068438E8ED181E1F2505E877A7543869B0B38E500F5FA284F9
Number of CPU thread: 7
GPU: GPU #0 Quadro M2200 (8x128 cores) Grid(64x128)

Pub Addr: 1111Cf8ucVbgUtANTRGwQsWVpXVZvqFT6
Prv Addr: 5HxepgskWZ53AokCCvk8d1ZZGinupSX4Sm7tNQygZ9zQpkftRQJ
Prv Key : 0x12FF1E3D528DC8068438E8ED181E1F2505E877A7543869B5B38E500F5FA4D5D3
Check   : 1DFm6mzxxKqFo9bysKC9x1TxEz5Z9d9uAb
Check   : 1111Cf8ucVbgUtANTRGwQsWVpXVZvqFT6 (comp)

How it is possible??  Huh
sr. member
Activity: 462
Merit: 701
Then we have the classic security problem of using pseudo-random seed. Alarm!
fix it faster to /dev/urandom

As written is the readme, for safe keys it is recommenced to use a passphrase using -s option (as for BIP38).
Concerning the default seed pbkdf2_hmac_sha512(date + uptime in us) , here we search for prefix, which means that a seed search attack might work on very short prefix and would require a very competitive and expensive hardware.

YES! moreover, I guarantee you that the mult of montgomery is a source of slow, especially for GPU.
...

As written is the readme, VanitySearch use now a 2 step folding modular multiplication optimized for SecpK1 prime.
jr. member
Activity: 38
Merit: 18
...If you don't specifie the seed, the basekey is generated using timestamps (in us) plus the date and also passed into the pbkdf2_hmac_sha512.
The result of the pbkdf2_hmac_sha512 is then passed into a SHA256 wich is use as the base key.
Then we have the classic security problem of using pseudo-random seed. Alarm!
fix it faster to /dev/urandom
and human is not a source of truly random numbers.
(you can search when the possibility to set start rand seed by the user has been removed from the electrum)
This is a useful option for your program, but only when the user understands what he is doing.

...
Edit:
You also may have noticed that I have an innovative implementation of modular inversion (DRS62) which is almost 2 times faster than the Montgomery one. Some benchmark and comments are available in IntMop.cpp.
...2) the field multiplication a*b = c mod p ;  why do you use Montgomery, are you sure it is worth it?
YES! moreover, I guarantee you that the mult of montgomery is a source of slow, especially for GPU.
Why? because the cycles in the algorithm necessarily contain conditional transitions (IF/cause), which greatly affects the parallelism(WARP) on GPU.
(i would keep silent if only cpu were used - no affect)
Why f montgomery? I see 2variants:
 1) historically - the Montgomery algorithm is optimal due to the simplicity of mult (and most importantly - division!) bit word shift on base2;
(and it - true! but not for gpu!)
 2) legacy vanitygen
when samir7 wrote vanitigen (2010-2012), he used openssl. openssl used montgomery always or almost for universality.
for mult 4096 bits - is need. but we have 256 bits! I don’t have exact data, how much we win using montgomery at 256 compared to classic(column?), but I think about 15%.
However, each IF in the cycle kill speed by HALF! In Montgomery there is 1 or 2 (depending on modify algorithm).
(i would keep silent if or we counted points on a curve secp4096k1)

When you tell me, "I used Montgomery to mult 256bit!", this is the same as "I used the Furers algorithm/Schonhage-Strassen to mult 256bit!". What can I say, you're good! Cool but why?)
I understand your ego math. Your vanity does not allow you to multiply by a Column or Karatsuba  Grin
Have you improved the multiplication of montgomery? You are well done! I am sure that there are few people in the world who can do this.
But please make this program as correct. With all sincere respect for you and your work!

Hmm.. some joke:
Hi, Santa!
We have VanityGen on universal slooow openssl, Bitcrack on libsecp256k1 + miser comm about TheoremFermi, and VanitySearch on libArulberosECC optimize with (fast?)montgomery.
Please, do it so that once they do it right!  Grin
(+and raw non-working ec_grind by sipa)

When an OpenCL implementation?  Smiley
I suppose never. I mean normal speed.
You know what happened with version clBitCrack?
Brichard19 promised to release in a couple of weeks, after two months something finally appeared, but at times slower than cuda.
And so far it is. What happened?  Brichard19 can not program? No I do not think so, he norm.
Just optimizing the compiler Cuda has gone a lot ahead. DireNvidia win, its fact.
Look for example: https://arxiv.org/ftp/arxiv/papers/1005/1005.2581.pdf

on the other hand... new release JohnTheRipper 1.9
https://www.openwall.com/lists/announce/2019/05/14/1
Quote
The release of the new version 1.9.0-jumbo-1 took place more than four years after the release of the previous 1.8.0-jumbo-1.

In the new version of John the Ripper, developers abandoned the CUDA architecture due to a decrease in interest in it and focused on the more portable OpenCL framework that works great on NVIDIA graphics cards.
cuda vs opencl?.. unclear
sr. member
Activity: 462
Merit: 701
An other report from a user using CUDA 8 and gcc 4.8 on a GeForce GTX 460. It works.
sr. member
Activity: 462
Merit: 701
Don't worry, cuda 8 needs g++ 4.9, that's the problem.

I use g++ 4.8/CUDA 8 with my old Quadro and it works.

About the performance, I think most of the people use only compressed addresses.

If you do a specific ComputeKeys for only compressed keys (don't compute y at all!):

Yes you're right, I will make a second kernel optimized for compressed addresses only.
legendary
Activity: 1932
Merit: 2077
Yes, I already did it.

It will make me crazy.
It works on my 2 configs and a user on github just post a report on a GeForce GTX 1080 Ti (ccap=6.1) running on Ubuntu 18.04 and it works fine (he uses CUDA10).

Don't worry, cuda 8 needs g++ 4.9, that's the problem.


About the performance, I think most of the people use only compressed addresses.

If you do a specific ComputeKeys for only compressed keys (don't compute y at all!):

Code:
    for (uint32_t i = 0; i < HSIZE; i++) {

      // P = StartPoint + i*G
      Load256(px, sx);
      Load256(py, sy);
      ModSub256(dy, Gy[i], py);

      _ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
      //_ModMult(_p2, _s, _s);        // _p = pow2(s)
      _ModSqr(_p2, _s);

      ModSub256(px, _p2,px);
      ModSub256(px, Gx[i]);         // px = pow2(s) - p1.x - p2.x;
      /*
      ModSub256(py, Gx[i], px);
      _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
      ModSub256(py, Gy[i]);         // py = - p2.y - s*(ret.x-p2.x);  
      */
      CHECK_PREFIX(GRP_SIZE / 2 + (i + 1));
      
      // P = StartPoint - i*G, if (x,y) = i*G then (x,-y) = -i*G
      Load256(px, sx);
      Load256(py, sy);
      //ModNeg256(dy,Gy[i]);
      //ModSub256(dy, py);
      ModSub256(dy, pyn, Gy[i]);

      _ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
      //_ModMult(_p2, _s, _s);        // _p = pow2(s)
      _ModSqr(_p2, _s);

      ModSub256(px, _p2, px);
      ModSub256(px, Gx[i]);         // px = pow2(s) - p1.x - p2.x;
      /*
      ModSub256(py, Gx[i], px);
      _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
      
      ModAdd256(py, Gy[i]);         // py = - p2.y - s*(ret.x-p2.x);  

      //ModSub256(py, sx, px);
      //_ModMult(py, _s);             // py = - s*(ret.x-p2.x)
      //ModSub256(py, sy);
      */
      CHECK_PREFIX(GRP_SIZE / 2 - (i + 1));

    }
    
    // First point (startP - (GRP_SZIE/2)*G)
    Load256(px, sx);
    Load256(py, sy);
    ModNeg256(dy, Gy[i]);
    ModSub256(dy, py);

    _ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
    //_ModMult(_p2, _s, _s);        // _p = pow2(s)
    _ModSqr(_p2, _s);

    ModSub256(px, _p2, px);
    ModSub256(px, Gx[i]);         // px = pow2(s) - p1.x - p2.x;
    /*
    ModSub256(py, Gx[i], px);
    _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
    
    ModAdd256(py, Gy[i]);         // py = - p2.y - s*(ret.x-p2.x);  
    */
    CHECK_PREFIX(0);

    i++;

    // Next start point (startP + GRP_SIZE*G)
    Load256(px, sx);
    Load256(py, sy);
    ModSub256(dy, _2Gny, py);

    _ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
    //_ModMult(_p2, _s, _s);        // _p = pow2(s)
    _ModSqr(_p2, _s);

    ModSub256(px, _p2, px);
    ModSub256(px, _2Gnx);         // px = pow2(s) - p1.x - p2.x;

    ModSub256(py, _2Gnx, px);
    _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
    //_ModSqr(py, _s);
    ModSub256(py, _2Gny);         // py = - p2.y - s*(ret.x-p2.x);  

    Load256(sx, px);
    Load256(sy, py);

  }

  // Update starting point
  __syncthreads();
  Store256A(startx, sx);

you can save time. Then: SHA256 ("02+x")  and SHA256("03+x") (without thinking at y value)

On my system I got about a 8% increase of performance.

Obviously at the end you have to do a double check to know if the correct private key for the found address is k or n-k. But only for the address found.
Pages:
Jump to: