Pages:
Author

Topic: VanitySearch (Yet another address prefix finder) - page 62. (Read 32916 times)

jr. member
Activity: 82
Merit: 1
Code:
G:\vanitysearch>vanitysearch -stop -t 0 -gpu -gpuId 0,1,2,3,4,5,6 1Testtttt
Start Sun Mar  3 22:19:31 2019
Search: 1Testtttt
Difficulty: 2988734397852221
Base Key:F0407AC53C32B6FD85A9EE9AB912B9650426BC87C5FB4470B89FEF71A853CF
Number of CPU thread: 0
GPU: GPU #5 GeForce GTX 1060 6GB (10x128 cores) Grid(160x64)
GPU: GPU #2 GeForce GTX 1060 6GB (10x128 cores) Grid(160x64)
GPU: GPU #0 GeForce GTX 1060 6GB (10x128 cores) Grid(160x64)
GPU: GPU #1 GeForce GTX 1060 6GB (10x128 cores) Grid(160x64)y]
GPU: GPU #3 GeForce GTX 1060 6GB (10x128 cores) Grid(160x64)
GPU: GPU #6 GeForce GTX 1060 3GB (9x128 cores) Grid(144x64)
GPU: GPU #4 GeForce GTX 1060 6GB (10x128 cores) Grid(160x64)
581.006 MK/s (GPU 581.006 MK/s) (2^35.94) [P 0.00%][50.00% in 41.5d]]

sr. member
Activity: 462
Merit: 701
everything works perfectly  Smiley  windows-10x64

Thank you very much for testing. Amazing config Cheesy
Just curious, try with -t 0 option. It will free the CPU cores. With such a config, the CPU may be a bottleneck (GPU/CPU transfers).

must I use nvidia gpu's with this?

Yes, I'll try to develop an OpenCL version.

Edit:
Next step will be to increase performance following precious advices from arulbero Wink
legendary
Activity: 4354
Merit: 9201
'The right to privacy matters'
must I use nvidia gpu's with this?

I have a thread ripper 1920x cpu with 4 amd vegas'

on hand.



I also have a ryzen 1800x with 2 gtx 1050ti's on hand
jr. member
Activity: 82
Merit: 1
Multi-GPU support is ready (Release 1.5), I tested it on Linux only, so If a Windows user can test it It would be great.

Example of usage (on a old PC here running Ubuntu 18-04, with 2 Quadro 600 inside):

Code:
$ ./VanitySearch -l
GPU #0 Quadro 600 (2x48 cores) (Cap 2.1) (963.3 MB) (Multiple host threads)
GPU #1 Quadro 600 (2x48 cores) (Cap 2.1) (964.5 MB) (Multiple host threads)

Code:
$ ./VanitySearch -stop -gpu -gpuId 0,1 1Test
Start Sun Mar  3 12:16:26 2019
Search: 1Test
Difficulty: 264104224
Base Key:593CB755EB63B403F247F9890BE2F0FEAB3E9023A779E18A6EA62FD6C3D1FDF5
Number of CPU thread: 1
GPU: GPU #1 Quadro 600 (2x48 cores) Grid(32x64)
GPU: GPU #0 Quadro 600 (2x48 cores) Grid(32x64)
11.009 MK/s (GPU 10.221 MK/s) (2^27.61) [P 53.96%][60.00% in 00:00:03]
Pub Addr: 1Test2JF73wznXjD3LYEfCw4kPqArkvAp
Prv Addr: 5JVb2RQC5APQXti4yaGyNwEyo4phmvm773YaxD6rG9jGyZZtP32
Prv Key : 0x593CB755EB63B403F247F9890BE2F0FEABBF9023A7FBE18A6EA62FD6C3D2BAEE
Check   : 1LZeyhprPQq64ctexwc4Bgo5h15ZSGRWkE
Check   : 1Test2JF73wznXjD3LYEfCw4kPqArkvAp (comp)

Thanks for testing Wink


everything works perfectly  Smiley  windows-10x64

Code:
G:\vanitysearch>vanitysearch -stop -gpu -gpuId 0,1,2,3,4,5,6 1Testtttt
Start Sun Mar  3 19:31:49 2019
Search: 1Testtttt
Difficulty: 2988734397852221
Base Key:3A7BB2F81F78F539A33498862C05256FA4BFE84B9550082788661B3A48F7DDD6
Number of CPU thread: 1
GPU: GPU #0 GeForce GTX 1060 6GB (10x128 cores) Grid(160x64)
GPU: GPU #2 GeForce GTX 1060 6GB (10x128 cores) Grid(160x64).6661y]
GPU: GPU #6 GeForce GTX 1060 3GB (9x128 cores) Grid(144x64)
GPU: GPU #5 GeForce GTX 1060 6GB (10x128 cores) Grid(160x64)
GPU: GPU #3 GeForce GTX 1060 6GB (10x128 cores) Grid(160x64)
GPU: GPU #1 GeForce GTX 1060 6GB (10x128 cores) Grid(160x64)
GPU: GPU #4 GeForce GTX 1060 6GB (10x128 cores) Grid(160x64)
546.837 MK/s (GPU 545.374 MK/s) (2^35.49) [P 0.00%][50.00% in 44.0d]
sr. member
Activity: 462
Merit: 701
Multi-GPU support is ready (Release 1.5), I tested it on Linux only, so If a Windows user can test it It would be great.

Example of usage (on a old PC here running Ubuntu 18-04, with 2 Quadro 600 inside):

Code:
$ ./VanitySearch -l
GPU #0 Quadro 600 (2x48 cores) (Cap 2.1) (963.3 MB) (Multiple host threads)
GPU #1 Quadro 600 (2x48 cores) (Cap 2.1) (964.5 MB) (Multiple host threads)

Code:
$ ./VanitySearch -stop -gpu -gpuId 0,1 1Test
Start Sun Mar  3 12:16:26 2019
Search: 1Test
Difficulty: 264104224
Base Key:593CB755EB63B403F247F9890BE2F0FEAB3E9023A779E18A6EA62FD6C3D1FDF5
Number of CPU thread: 1
GPU: GPU #1 Quadro 600 (2x48 cores) Grid(32x64)
GPU: GPU #0 Quadro 600 (2x48 cores) Grid(32x64)
11.009 MK/s (GPU 10.221 MK/s) (2^27.61) [P 53.96%][60.00% in 00:00:03]
Pub Addr: 1Test2JF73wznXjD3LYEfCw4kPqArkvAp
Prv Addr: 5JVb2RQC5APQXti4yaGyNwEyo4phmvm773YaxD6rG9jGyZZtP32
Prv Key : 0x593CB755EB63B403F247F9890BE2F0FEABBF9023A7FBE18A6EA62FD6C3D2BAEE
Check   : 1LZeyhprPQq64ctexwc4Bgo5h15ZSGRWkE
Check   : 1Test2JF73wznXjD3LYEfCw4kPqArkvAp (comp)

Thanks for testing Wink
jr. member
Activity: 82
Merit: 1
Hello,

CUDA support for Linux is ok. I added few notes about compilation under Linux using CUDA in the README.
It tested it successfully on Ubuntu 18-04 with the CUDA SDK 8.0 ( for my old Quadro 600 Wink ).

https://github.com/JeanLucPons/VanitySearch/blob/master/README.md

I'm working now on multi-GPU support.


This is good news, I’m waiting for support multi-GPU, if you need help, I’m ready to help with testing  Smiley
sr. member
Activity: 462
Merit: 701
Hello,

CUDA support for Linux is ok. I added few notes about compilation under Linux using CUDA in the README.
It tested it successfully on Ubuntu 18-04 with the CUDA SDK 8.0 ( for my old Quadro 600 Wink ).

https://github.com/JeanLucPons/VanitySearch/blob/master/README.md

I'm working now on multi-GPU support.
sr. member
Activity: 462
Merit: 701
Thanks for the link Smiley
On the GPU, I must say I don't have a clear idea. Nsight is not obvious and its difficult to interpret results. It's good for determining if the GPU is well used (grid size, stream processor occupancy, memory transfers, ...) but I didn't manage to get a clear profile function by function. The GPU does not make Base58, it computes up to the hash160 and send them back to the CPU which check full base58 addresses.
Concerning the OpenCL version, I will see, I'm not familiar with it.

legendary
Activity: 1948
Merit: 2097
Hello,

I would like to thanks arulbero who gave me by MP a great tip to improve speed by MP using some symmetries Wink
I missed this, shame on me.
It will save few modular mult. But however, ~40% of cpu is used for modular mult, other 60% mainly go to SHA,RIPE,Base58,ModInv and byteswapping, so I don't know if I can reach the 2.0MKey/s (x 1.66)
For linux (cpu side), I have to work on code generation optimization but assembly using AT&T syntax makes me crazy.

As reference for SHA and RIPE, you could look here: https://github.com/klynastor/supervanitygen

I don't use Base58 in my code, because I need only address in hex format, not Base58.

When an OpenCL implementation?  Smiley


EDIT: on cpu 40% is used for ecc arithmetic; on gpu? I'm curious.
sr. member
Activity: 462
Merit: 701
Hello,

I would like to thanks arulbero who gave me by MP a great tip to improve speed by MP using some symmetries Wink
I missed this, shame on me.
It will save few modular mult. But however, ~40% of cpu is used for modular mult, other 60% mainly go to SHA,RIPE,Base58,ModInv and byteswapping, so I don't know if I can reach the 2.0MKey/s (x 1.66)
For linux (cpu side), I have to work on code generation optimization but assembly using AT&T syntax makes me crazy.

Anyway, I managed to set-up CUDA sdk 8.0 on the old Ubuntu PC. I had to patch the nvidia driver, a nightmare.
But now CUDA works, I managed to compile sample code and make it work, so i will be able to develop the multi GPU release of vanitysearch.
sr. member
Activity: 462
Merit: 701
I have to way 1 hour to answer to your last MP Sad
It's time for me to go to sleep.
See you Smiley
sr. member
Activity: 462
Merit: 701
b) a * b = c mod p   a*b --> 8 * 64 bit, then first 4 limbs * (2**256 - p) + lower 4 limbs.

I tried this. ~same performance as the multiplication by P (for secpk1) for mmult  can be reduced in a single 64bit mult. So I'm interested in c.
OK, on linux, performace are still bad, i'm sorry. Some problem with intrinsic....
legendary
Activity: 1948
Merit: 2097
Linux or windows ?
Is it open source ? Can i try it ?
Linux. You have a PM
sr. member
Activity: 462
Merit: 701
Linux or windows ?
Is it open source ? Can i try it ?
legendary
Activity: 1948
Merit: 2097
A group size of 512 does not bring significant improvement (less than 1%). The DRS62 ModInv is fast and almost negligible with a group size of 256.
If you have a modular mult faster than the digit serial Montgomery mult on a 256bit field, I'm obviously fully open. A folding does not improve thing on 256 bit when working with 64bit digits. I'm not sure if Barrett could be faster, I must say I didn't try and for "medium size field", there can be traps.


On my pc:

VanitySearch -stop -u -t 1 1tryme --> 1,2 MKeys/s

my ecc library  --> 2,0 MKeys/s  (17 M Public keys/s)

EDIT:
I use:

a) group of 4096 points
b) a * b = c mod p   a*b --> 8 * 64 bit, then first 4 limbs * (2**256 - p) + lower 4 limbs.
c) exploit some properties of secp256k1 curve



sr. member
Activity: 462
Merit: 701
A group size of 512 does not bring significant improvement (less than 1%). The DRS62 ModInv is fast and almost negligible with a group size of 256.
If you have a modular mult faster than the digit serial Montgomery mult on a 256bit field, I'm obviously fully open. A folding does not improve thing on 256 bit when working with 64bit digits. I'm not sure if Barrett could be faster, I must say I didn't try and for "medium size field", there can be traps.


legendary
Activity: 1948
Merit: 2097
Hello,

Affine coordinates for search (faster):
Each group perform p = startP + i*G, i in [1..group_size] where i*G is a pre-computed table containing G,2G,3G,.... in affine coordinates. The inversion of deltax (dx1-dx2) is done once per group (1 ModInv and 256*3 mult). group_size is 256 key long.

Protective coordinates for EC multiplication (computation of starting keys). Normalization of the key is done after the multiplication for starting key.

Edit:
You also may have noticed that I have an innovative implementation of modular inversion (DRS62) which is almost 2 times faster than the Montgomery one. Some benchmark and comments are available in IntMop.cpp.


Ok.
two questions:

1) why only 256 for the group size? There is a memory problem? Less inversions are better

2) the field multiplication a*b = c mod p ;  why do you use Montgomery, are you sure it is worth it?
sr. member
Activity: 462
Merit: 701
Hello,

Some news:
I just published (1.4) a new release with few fixes (especially for Linux) but the un-initialized memory bug may also affect Windows (I didn't manage to reproduced this bug on Windows but it can be random).

I managed to get back an old PC from my company (~8 years old) with 2 Quadro 600 inside Smiley
Unfortunately the Quadro 600 (fermi) has only compute capability 2.1 and I will have to set-up CUDA SDK 8.0 (the last one which supports fermi). I set up Ubuntu on this PC and I will try to develop the multi GPU release under Linux.
Hope I will manage to get good drivers for the Quadro 600 and to make it work.
sr. member
Activity: 462
Merit: 701
Hello,

Affine coordinates for search (faster):
Each group perform p = startP + i*G, i in [1..group_size] where i*G is a pre-computed table containing G,2G,3G,.... in affine coordinates. The inversion of deltax (dx1-dx2) is done once per group (1 ModInv and 256*3 mult). group_size is 256 key long.

Protective coordinates for EC multiplication (computation of starting keys). Normalization of the key is done after the multiplication for starting key.

Edit:
You also may have noticed that I have an innovative implementation of modular inversion (DRS62) which is almost 2 times faster than the Montgomery one. Some benchmark and comments are available in IntMop.cpp.
legendary
Activity: 1948
Merit: 2097
Hello,

I would like to present a new bitcoin prefix address finder called VanitySearch. It is very similar to Vanitygen.
The main differences with Vanitygen are that VanitySearch is not using the heavy OpenSSL for CPU calculation and that the kernel is written in Cuda in order to take full advantage of inline PTX assembly.
On my Intel Core i7-4770, VanitySearch runs ~4 times faster than vanitygen64. (1.32 Mkey/s -> 5.27  MK/s)
On my  GeForce GTX 645, VanitySearch runs ~1.5 times faster than oclvanitygen. (9.26 Mkey/s -> 14.548 MK/s)
If you want to compare VanitySearch and Vanitygen result, use the -u option for searching uncompressed address.

There is still lots of improvement to do.
Feel free to test it and to submit issue.


Are you using affine or jacobian coordinates for the points?
Pages:
Jump to: