This post is reserved for the status of the previous one.
Status:At the moment, the correction in the previous post is not final and not fully verified, you use it at your own risk.At the moment, the correction in the previous post is final and fully tested.
This is an old unresolved fundamental bug in kernel openCL, and I will help all of us fix it.
Fact: all known cases are united by - Nvidia+other(without AMD), yes, its a "green bug"progress tests
[v] rigGPU (nvidia) acceleratedly searches the crash key
[v] the crash key found and the crash is reproduce stable
[v] after used fix kernel - the crash is stopped
[v] rigGPU (nvidia) acceleratedly checking fix kernel, stability test
[v] the crash key is not reproduce the crash if vliw(amd) macros are used
[v] rigGPU (nvidia) acceleratedly searches the crash key, using vliw(amd) macros
[v] the crash key not found and vliw(amd) macros no have this bug
################Further will provide a detailed analysis of the problems, how and why they arose for those who are interested in understanding.
This bug occurs as a result of the implementation of the library of Bignum, using type int32 overflow.
The following error occurred under boundary conditions during max+1/min-1
overflow mechanics as a cycle
min=0x00000000
max=0xffffffff
max+1=min(overflow!)
min-1=max(overflow!)
Note for newbies:
We cant add/sub/mult one bignumA by bignumB at a 1time, but we can do it in N steps.
Bignum X (e.g. 256bits) split into N=8 words (e.g. 32bits each) (32x8=256bit)
Here, bn_subb_word() is macros for calculate (bignumA - bignumB), step-by-step, (n_wordA - n_wordB) + transfer 1 to next step(n+1) if overflow happend.
orig code:
#define bn_subb_word(r, a, b, t, c) do { \
t = a - (b + c); \
c = (!(a) && c) ? 1 : 0; \
c |= (a < b) ? 1 : 0; \
r = t; \
} while (0)
r - result (a-b)
a - n_wordA of bignumA
b - n_wordB of bignumB
t - tmp var 32bit, sensitive to overflow
c - carry 1digit, it transfer flag, +/-1 to next word if overflow happened(c==1)
look to:
c = (!(a) && c) ? 1 : 0;
hmm.. how about:
if [a!=0, a==b, c==1], than [a-(b+c) = max(+overflow)], this result correct but overflow not be detect!
we can add compare to fix:
c = (c && (a == b)) ? 1 : 0;
but +1 str in kernel cycle with billions interations - bad idea!
dont add - we need replace it, we can it because it includes previous compare
Fact: after this replace - the crash stopped!final fix bn_subb_word():
#define bn_subb_word(r, a, b, t, c) do { \
t = a - (b + c); \
c = (c && (a == b)) ? 1 : 0; \
c |= (a < b) ? 1 : 0; \
r = t; \
} while (0)
Alternative fix - you can force the use of vliw(amd) macros for Nvidia.
Just add str to head of calc_addrs.cl:
#define DEEP_VLIW 1
################And a little more about possible optimization.
1) The main problem is the slow openssl lib, moving to secp256k1 lib will already improve performance.
2) When calculating the compressed key, the vanitygen calculates the Y coordinate, although this is not necessary.
In fact, the program always considers an uncompressed key, and compresses it at the end if requested.
3) Squaring requires fewer multiplications than multiplying two different numbers.
My attempts to create bn_sqr_mont () failed, as adding IF() inevitably breaks the parallelism of the GPU.
IF can be bypassed, but at the cost of increasing spill register. Montgomery and the GPU are a very problematic couple.
4) Symmetric rather than sequential calculation of the inversion batch is also be get acceleration.
5) Symmetric Y and endomorphism(lambda/betta) of ecdsa is also be get acceleration.
All these optimizations are possible, but it’s a waste of time.
Because VanitySearch came out, deprived of all the above disadvantages.
https://bitcointalksearch.org/topic/vanitysearch-yet-another-address-prefix-finder-5112311Now the only advantage VanityGen is OpenCL and the expanded support for altcoins in exploitagency/vanitygen-plus.
################Final edit at Aug 15 2019