VanitySearch (Yet another address prefix finder) - page 54.

Jean_Luc

sr. member

Activity: 462

Merit: 701

Quote from: Telariust on March 23, 2019, 11:13:11 AM

Then we have the classic security problem of using pseudo-random seed. Alarm!
fix it faster to /dev/urandom

As written is the readme, for safe keys it is recommenced to use a passphrase using -s option (as for BIP38).
Concerning the default seed pbkdf2_hmac_sha512(date + uptime in us) , here we search for prefix, which means that a seed search attack might work on very short prefix and would require a very competitive and expensive hardware.

Quote from: Telariust on March 23, 2019, 11:13:11 AM

YES! moreover, I guarantee you that the mult of montgomery is a source of slow, especially for GPU.
...

As written is the readme, VanitySearch use now a 2 step folding modular multiplication optimized for SecpK1 prime.

Telariust

jr. member

Activity: 38

Merit: 18

Quote from: Jean_Luc on March 04, 2019, 10:06:09 AM

...If you don't specifie the seed, the basekey is generated using timestamps (in us) plus the date and also passed into the pbkdf2_hmac_sha512.
The result of the pbkdf2_hmac_sha512 is then passed into a SHA256 wich is use as the base key.

Then we have the classic security problem of using pseudo-random seed. Alarm!
fix it faster to /dev/urandom
and human is not a source of truly random numbers.
(you can search when the possibility to set start rand seed by the user has been removed from the electrum)
This is a useful option for your program, but only when the user understands what he is doing.

Quote from: arulbero on February 28, 2019, 01:27:37 PM

Quote from: Jean_Luc on February 28, 2019, 03:10:41 AM

...
Edit:
You also may have noticed that I have an innovative implementation of modular inversion (DRS62) which is almost 2 times faster than the Montgomery one. Some benchmark and comments are available in IntMop.cpp.

...2) the field multiplication a*b = c mod p ; why do you use Montgomery, are you sure it is worth it?

YES! moreover, I guarantee you that the mult of montgomery is a source of slow, especially for GPU.
Why? because the cycles in the algorithm necessarily contain conditional transitions (IF/cause), which greatly affects the parallelism(WARP) on GPU.
(i would keep silent if only cpu were used - no affect)
Why f montgomery? I see 2variants:
1) historically - the Montgomery algorithm is optimal due to the simplicity of mult (and most importantly - division!) bit word shift on base2;
(and it - true! but not for gpu!)
2) legacy vanitygen
when samir7 wrote vanitigen (2010-2012), he used openssl. openssl used montgomery always or almost for universality.
for mult 4096 bits - is need. but we have 256 bits! I don’t have exact data, how much we win using montgomery at 256 compared to classic(column?), but I think about 15%.
However, each IF in the cycle kill speed by HALF! In Montgomery there is 1 or 2 (depending on modify algorithm).
(i would keep silent if or we counted points on a curve secp4096k1)

When you tell me, "I used Montgomery to mult 256bit!", this is the same as "I used the Furers algorithm/Schonhage-Strassen to mult 256bit!". What can I say, you're good! Cool

but why?)
I understand your ego math. Your vanity does not allow you to multiply by a Column or Karatsuba Grin

Have you improved the multiplication of montgomery? You are well done! I am sure that there are few people in the world who can do this.
But please make this program as correct. With all sincere respect for you and your work!

Hmm.. some joke:
Hi, Santa!
We have VanityGen on universal slooow openssl, Bitcrack on libsecp256k1 + miser comm about TheoremFermi, and VanitySearch on libArulberosECC optimize with (fast?)montgomery.
Please, do it so that once they do it right! Grin

(+and raw non-working ec_grind by sipa)

Quote from: arulbero on March 01, 2019, 12:06:37 PM

When an OpenCL implementation?

I suppose never. I mean normal speed.
You know what happened with version clBitCrack?
Brichard19 promised to release in a couple of weeks, after two months something finally appeared, but at times slower than cuda.
And so far it is. What happened? Brichard19 can not program? No I do not think so, he norm.
Just optimizing the compiler Cuda has gone a lot ahead. ~~Dire~~Nvidia win, its fact.
Look for example: https://arxiv.org/ftp/arxiv/papers/1005/1005.2581.pdf

on the other hand... new release JohnTheRipper 1.9
https://www.openwall.com/lists/announce/2019/05/14/1

Quote

The release of the new version 1.9.0-jumbo-1 took place more than four years after the release of the previous 1.8.0-jumbo-1.

In the new version of John the Ripper, developers abandoned the CUDA architecture due to a decrease in interest in it and focused on the more portable OpenCL framework that works great on NVIDIA graphics cards.

cuda vs opencl?.. unclear

Jean_Luc

sr. member

Activity: 462

Merit: 701

An other report from a user using CUDA 8 and gcc 4.8 on a GeForce GTX 460. It works.

Jean_Luc

sr. member

Activity: 462

Merit: 701

Quote from: arulbero on March 22, 2019, 08:32:23 AM

Don't worry, cuda 8 needs g++ 4.9, that's the problem.

I use g++ 4.8/CUDA 8 with my old Quadro and it works.

Quote from: arulbero on March 22, 2019, 08:32:23 AM

About the performance, I think most of the people use only compressed addresses.

If you do a specific ComputeKeys for only compressed keys (don't compute y at all!):

Yes you're right, I will make a second kernel optimized for compressed addresses only.

arulbero

legendary

Activity: 1968

Merit: 2130

Quote from: Jean_Luc on March 22, 2019, 08:29:48 AM

Quote from: arulbero on March 22, 2019, 08:20:43 AM

Yes, I already did it.

It will make me crazy.
It works on my 2 configs and a user on github just post a report on a GeForce GTX 1080 Ti (ccap=6.1) running on Ubuntu 18.04 and it works fine (he uses CUDA10).

Don't worry, cuda 8 needs g++ 4.9, that's the problem.

About the performance, I think most of the people use only compressed addresses.

If you do a specific ComputeKeys for only compressed keys (don't compute y at all!):

Code:

    for (uint32_t i = 0; i < HSIZE; i++) {

      // P = StartPoint + i*G
      Load256(px, sx);
      Load256(py, sy);
      ModSub256(dy, Gy[i], py);

      _ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
      //_ModMult(_p2, _s, _s);        // _p = pow2(s)
      _ModSqr(_p2, _s);

      ModSub256(px, _p2,px);
      ModSub256(px, Gx[i]);         // px = pow2(s) - p1.x - p2.x;
      /*
      ModSub256(py, Gx[i], px);
      _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
      ModSub256(py, Gy[i]);         // py = - p2.y - s*(ret.x-p2.x);  
      */
      CHECK_PREFIX(GRP_SIZE / 2 + (i + 1));
      
      // P = StartPoint - i*G, if (x,y) = i*G then (x,-y) = -i*G
      Load256(px, sx);
      Load256(py, sy);
      //ModNeg256(dy,Gy[i]);
      //ModSub256(dy, py);
      ModSub256(dy, pyn, Gy[i]);

      _ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
      //_ModMult(_p2, _s, _s);        // _p = pow2(s)
      _ModSqr(_p2, _s);

      ModSub256(px, _p2, px);
      ModSub256(px, Gx[i]);         // px = pow2(s) - p1.x - p2.x;
      /*
      ModSub256(py, Gx[i], px);
      _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
      
      ModAdd256(py, Gy[i]);         // py = - p2.y - s*(ret.x-p2.x);  

      //ModSub256(py, sx, px);
      //_ModMult(py, _s);             // py = - s*(ret.x-p2.x)
      //ModSub256(py, sy); 
      */
      CHECK_PREFIX(GRP_SIZE / 2 - (i + 1));

    }
    
    // First point (startP - (GRP_SZIE/2)*G)
    Load256(px, sx);
    Load256(py, sy);
    ModNeg256(dy, Gy[i]);
    ModSub256(dy, py);

    _ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
    //_ModMult(_p2, _s, _s);        // _p = pow2(s)
    _ModSqr(_p2, _s);

    ModSub256(px, _p2, px);
    ModSub256(px, Gx[i]);         // px = pow2(s) - p1.x - p2.x;
    /*
    ModSub256(py, Gx[i], px);
    _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
    
    ModAdd256(py, Gy[i]);         // py = - p2.y - s*(ret.x-p2.x);  
    */
    CHECK_PREFIX(0);

    i++;

    // Next start point (startP + GRP_SIZE*G)
    Load256(px, sx);
    Load256(py, sy);
    ModSub256(dy, _2Gny, py);

    _ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
    //_ModMult(_p2, _s, _s);        // _p = pow2(s)
    _ModSqr(_p2, _s);

    ModSub256(px, _p2, px);
    ModSub256(px, _2Gnx);         // px = pow2(s) - p1.x - p2.x;

    ModSub256(py, _2Gnx, px);
    _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
    //_ModSqr(py, _s);
    ModSub256(py, _2Gny);         // py = - p2.y - s*(ret.x-p2.x);  

    Load256(sx, px);
    Load256(sy, py);

  }

  // Update starting point
  __syncthreads();
  Store256A(startx, sx);

you can save time. Then: SHA256 ("02+x") and SHA256("03+x") (without thinking at y value)

On my system I got about a 8% increase of performance.

Obviously at the end you have to do a double check to know if the correct private key for the found address is k or n-k. But only for the address found.

Jean_Luc

sr. member

Activity: 462

Merit: 701

Quote from: arulbero on March 22, 2019, 08:20:43 AM

Yes, I already did it.

It will make me crazy.
It works on my 2 configs and a user on github just post a report on a GeForce GTX 1080 Ti (ccap=6.1) running on Ubuntu 18.04 and it works fine (he uses CUDA10).

arulbero

legendary

Activity: 1968

Merit: 2130

Quote from: Jean_Luc on March 22, 2019, 08:18:17 AM

You may have notice that I changed the makefile.
Now you should call it like this:

Code:

make gpu=1 ccap=50 all

And also set the good variable:

Code:

CUDA       = /usr/local/cuda-8.0
CXXCUDA    = /usr/bin/g++-4.8

The readme is up-to-date

Yes, I already did it.

Jean_Luc

sr. member

Activity: 462

Merit: 701

You may have notice that I changed the makefile.
Now you should call it like this:

Code:

make gpu=1 ccap=50 all

And also set the good variable:

Code:

CUDA       = /usr/local/cuda-8.0
CXXCUDA    = /usr/bin/g++-4.8

The readme is up-to-date

arulbero

legendary

Activity: 1968

Merit: 2130

Quote from: Jean_Luc on March 22, 2019, 08:15:53 AM

Quote from: arulbero on March 22, 2019, 07:56:49 AM

Unfortunately all wrong!!!

That's strange. May be I introduced an other bug.
If you restore the volatile it works ?

No.

Jean_Luc

sr. member

Activity: 462

Merit: 701

Quote from: arulbero on March 22, 2019, 07:56:49 AM

Unfortunately all wrong!!!

That's strange. May be I introduced an other bug.
If you restore the volatile it works ?

arulbero

legendary

Activity: 1968

Merit: 2130

Quote from: Jean_Luc on March 22, 2019, 03:09:08 AM

I removed again the volatile and added "memory" to clobber list of inline assembly. This should prevent the compiler to permute instruction (for pipelining optimization) and loose a carry or get a unexpected one.

Thanks to test the source on github and tell me if you still have the errors.

This is my last idea...

Unfortunately all wrong!!!

Code:

CPU found 1577 items
GPU: point   correct [0/243]
GPU: endo #1 correct [0/251]
GPU: endo #2 correct [0/268]
GPU: sym/point   correct [0/257]
GPU: sym/endo #1 correct [0/256]
GPU: sym/endo #2 correct [0/302]
GPU/CPU check Failed !

Jean_Luc

sr. member

Activity: 462

Merit: 701

Quote from: arulbero on March 21, 2019, 12:31:18 PM

You can delete:

and delete u0, u1, u2 ,u3, r0, r1, r2, r3

I committed your mods and I removed unused variable and changed a bit the squaring, I just replaced the reset of variable t1 and t2 by UADD(t1, 0x0ULL, 0x0ULL); . With this, it is no longer necessary to reset to 0 t1 or t2, t1 is set with carry flag.
I also added my reduction which use MADC instruction (multiply and add).

You can try both implementation by changing at GPUEngine.gu:665

Code:

#if 1

to

Code:

#if 0

I also ported your ModSqr to CPU release in IntMod.cpp.

On my hardware no significant performance increase, the square is ~10% faster the classic mult, so on the global process, no measurable performance increase.

I removed again the volatile and added "memory" to clobber list of inline assembly. This should prevent the compiler to permute instruction (for pipelining optimization) and loose a carry or get a unexpected one.

Thanks to test the source on github and tell me if you still have the errors.

This is my last idea...

Jean_Luc

sr. member

Activity: 462

Merit: 701

It can be due to a wrong optimization concerning a carry somewhere which could explain that it works from time to time.
I had a similar problem with the CPU release when I compiled with gcc 6, gcc 7 or Visual C++ work flawlessly.
The patch (a volatile also) is at IntMop.cpp:859 and IntMp.cpp:915

Jean_Luc

sr. member

Activity: 462

Merit: 701

You can also try VanitySearch -u -check
It will perform the check using uncompressed addresses and so use the CheckHashUncomp() function which is similar except that it calls GetHash160() instead of GetHash160Comp()

Jean_Luc

sr. member

Activity: 462

Merit: 701

endo and sym are computed in CheckHashComp() in GPUCompute.h.
I quoted my last post and I added few comments.
The point (px,py) is always OK so no errors before CHECK_POINT(h, incr, 0);
The errors randomly appear after this line.
It seems that nvcc generates (in your case) a wrong code.

Quote from: Jean_Luc on March 17, 2019, 04:48:34 AM

Code:

__device__ __noinline__ void CheckHashComp(prefix_t *prefix, uint64_t *px, uint64_t *py,
  int32_t incr, uint32_t tid, uint32_t *lookup32, uint32_t *out) {

  uint32_t   h[20];
  uint64_t   pe1x[4];
  uint64_t   pe2x[4];

  // Point
  _GetHash160Comp(px, py, (uint8_t *)h);
  CHECK_POINT(h, incr, 0);                         <-- 100% Ok up to here, means that (px,py) is good

  // Endo #1  if (x, y) = k * G, then (beta*x, y) = lambda*k*G
  _ModMult(pe1x, px, _beta); 
  _GetHash160Comp(pe1x, py, (uint8_t *)h);   <-- 50% Wrong from here
  CHECK_POINT(h, incr, 1);

  // Endo #2 if (x, y) = k * G, then (beta2*x, y) = lambda2*k*G
  _ModMult(pe2x, px, _beta2);
  _GetHash160Comp(pe2x, py, (uint8_t *)h);
  CHECK_POINT(h, incr, 2);

  ModNeg256(py);

  // Symetric points

  _GetHash160Comp(px, py, (uint8_t *)h);
  CHECK_POINT(h, -incr, 0);
  _GetHash160Comp(pe1x, py, (uint8_t *)h);
  CHECK_POINT(h, -incr, 1);
  _GetHash160Comp(pe2x, py, (uint8_t *)h);
  CHECK_POINT(h, -incr, 2);

}

arulbero

legendary

Activity: 1968

Merit: 2130

Quote from: Jean_Luc on March 21, 2019, 12:17:11 PM

OK thanks, it works

On my 645 GTX same performance. Sqr bring few spill moves more (there is more temp variables than in ModMult).
I didn't try yet on the OLD Quadro 600.
I will see If I can win few registers.

With Sqr
1> 33280 bytes stack frame, 128 bytes spill stores, 436 bytes spill loads
Without Sqr
1> 33280 bytes stack frame, 120 bytes spill stores, 424 bytes spill loads

You can delete:

Code:

  //uint64_t r0 = 0x0ULL;
  //uint64_t r1 = 0x0ULL;
  //uint64_t r3 = 0x0ULL;
  //uint64_t r4 = 0x0ULL;

  uint64_t r0;
  uint64_t r1;
  uint64_t r3 ;
  uint64_t r4 ;

and delete u0, u1, u2 ,u3, r0, r1, r2, r3

arulbero

legendary

Activity: 1968

Merit: 2130

Code:

CPU found 1559 items
GPU: point   correct [249/249]
GPU: endo #1 correct [203/281]
GPU: endo #2 correct [220/286]
GPU: sym/point   correct [102/246]
GPU: sym/endo #1 correct [180/248]
GPU: sym/endo #2 correct [179/249]
GPU/CPU check Failed !

Where you compute endo and sym?
Without "volatile" I always get these errors.

Jean_Luc

sr. member

Activity: 462

Merit: 701

OK thanks, it works

On my 645 GTX same performance. Sqr bring few spill moves more (there is more temp variables than in ModMult).
I didn't try yet on the OLD Quadro 600.
I will see If I can win few registers.

With Sqr
1> 33280 bytes stack frame, 128 bytes spill stores, 436 bytes spill loads
Without Sqr
1> 33280 bytes stack frame, 120 bytes spill stores, 424 bytes spill loads

arulbero

legendary

Activity: 1968

Merit: 2130

Quote from: Jean_Luc on March 21, 2019, 11:02:33 AM

Quote from: arulbero on March 21, 2019, 10:20:19 AM

From 153 MKeys/s to 160 MKeys/s

using a _ModSqr instead of _ModMult

Thanks, I tried but the -check failed.
I will have a look at it.
I committed the patch with few of your mods , i also review a bit the main loop.

There were errors. Now it should work:

Code:

__device__ void _ModSqr(uint64_t *rp, const uint64_t *up) {


  uint64_t u0 = up[0];
  uint64_t u1 = up[1];
  uint64_t u2 = up[2];
  uint64_t u3 = up[3];


  uint64_t u10, u11;


  uint64_t r0 = 0x0ULL;
  uint64_t r1 = 0x0ULL;
  uint64_t r3 = 0x0ULL;
  uint64_t r4 = 0x0ULL;

  uint64_t t1 = 0x0ULL;
  uint64_t t2 = 0x0ULL;

  uint64_t s0, s1, s2, s3, s4, s5, s6, s7;
  uint64_t z1, z2, z3, z4, z5, z6, z7, z8;

  //k=0
  UMULLO(s0, u0, u0); 
  UMULHI(r1, u0, u0);

  //k=1
  UMULLO(r3, u0, u1); 
  UMULHI(r4, u0, u1);
  UADDO1(r3, r3);
  UADDC1(r4, r4);
  UADD1(t1, 0x0ULL);
  UADDO1(r3, r1);
  UADDC1(r4, 0x0ULL);
  UADD1(t1, 0x0ULL);
  s1 = r3;



  //k=2
  UMULLO(r0, u0, u2);
  UMULHI(r1, u0, u2);
  UADDO1(r0, r0);
  UADDC1(r1, r1);
  UADD1(t2, 0x0ULL);
  UMULLO(u10, u1, u1);
  UMULHI(u11, u1, u1);
  UADDO1(r0, u10);
  UADDC1(r1, u11);
  UADD1(t2, 0x0ULL);
  UADDO1(r0, r4);
  UADDC1(r1, t1);
  UADD1(t2, 0x0ULL);


  s2 = r0;
  t1 = 0;
  
  //k=3
  UMULLO(r3, u0, u3);
  UMULHI(r4, u0, u3);
  UMULLO(u10, u1, u2);
  UMULHI(u11, u1, u2);
  UADDO1(r3, u10);
  UADDC1(r4, u11);
  UADD1(t1, 0x0ULL); 
  t1 += t1; 
  UADDO1(r3, r3);
  UADDC1(r4, r4);
  UADD1(t1, 0x0ULL);
  UADDO1(r3, r1);
  UADDC1(r4, t2);
  UADD1(t1, 0x0ULL); 

  s3 = r3;
  t2 = 0;

  //k=4

  UMULLO(r0, u1, u3);
  UMULHI(r1, u1, u3); 
  UADDO1(r0, r0);
  UADDC1(r1, r1);
  UADD1(t2, 0x0ULL);
  UMULLO(u10, u2, u2);
  UMULHI(u11, u2, u2);
  UADDO1(r0, u10);
  UADDC1(r1, u11);
  UADD1(t2, 0x0ULL); 
  UADDO1(r0, r4);
  UADDC1(r1, t1);
  UADD1(t2, 0x0ULL);

  s4 = r0; 
  t1 = 0;

  //k=5
  UMULLO(r3, u2, u3);
  UMULHI(r4, u2, u3); 
  UADDO1(r3, r3);
  UADDC1(r4, r4);
  UADD1(t1, 0x0ULL);
  UADDO1(r3, r1);
  UADDC1(r4, t2);
  UADD1(t1, 0x0ULL); 

  s5 = r3;



  //k=6
  UMULLO(r0, u3, u3);
  UMULHI(r1, u3, u3);
  UADDO1(r0, r4);
  UADD1(r1, t1);
  s6 = r0;
  
  //k=7
  s7 = r1;
  
  //Reduction
  UMULLO(z3, s5, 0x1000003d1ULL);
  UMULHI(z4, s5, 0x1000003d1ULL);
  UMULLO(z5, s6, 0x1000003d1ULL);
  UMULHI(z6, s6, 0x1000003d1ULL);
  UMULLO(z7, s7, 0x1000003d1ULL);
  UMULHI(z8, s7, 0x1000003d1ULL);
  UMULLO(z1, s4, 0x1000003d1ULL);
  UMULHI(z2, s4, 0x1000003d1ULL);
  UADDO1(z1, s0);
  UADD1(z2, 0x0ULL);


  UADDO1(z2, s1);
  UADDC1(z4, s2);
  UADDC1(z6, s3);
  UADD1(z8, 0x0ULL);

  
  //uint64_t c = 0;


  UADDO1(z3, z2);
  UADDC1(z5, z4);
  UADDC1(z7, z6);
  UADD1(z8, 0x0ULL);
 
  UMULLO(u10, z8, 0x1000003d1ULL);
  UMULHI(u11, z8, 0x1000003d1ULL);
  UADDO1(z1, u10);
  UADDC1(z3, u11);
  UADDC1(z5, 0x0ULL);
  UADD1(z7, 0x0ULL); 

  /*
  UADD1(c, 0x0ULL);   
      
  rp[0] = z1;
  rp[1] = z3;
  if(c == 1){

     UADDO1(z5, 0x1ULL);
     UADD1(z7, 0x0ULL); 

  }
  */

  rp[0] = z1;
  rp[1] = z3;
  rp[2] = z5;
  rp[3] = z7;
  
 
}

Jean_Luc

sr. member

Activity: 462

Merit: 701

Quote from: arulbero on March 21, 2019, 10:20:19 AM

From 153 MKeys/s to 160 MKeys/s

using a _ModSqr instead of _ModMult

Thanks, I tried but the -check failed.
I will have a look at it.
I committed the patch with few of your mods , i also review a bit the main loop.

Topic: VanitySearch (Yet another address prefix finder) - page 54. (Read 33086 times)