Pollard's kangaroo ECDLP solver - page 95.

arulbero

legendary

Activity: 1948

Merit: 2097

Quote from: Jean_Luc on June 18, 2020, 09:14:46 PM

Quote from: arulbero on June 18, 2020, 05:23:26 PM

@JeanLuc
How much constant memory do you use for the multiplication and for the addition?
32 jumps are 16kB for x and y-coordinate + 8 kB for their private keys (32 * 256bit = 8kB) + what else?

I use the following setting to prefer L1 cache as shared mem is not used.
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);

In constant mem:

Code:

__device__ __constant__ uint64_t _0[] = { 0ULL,0ULL,0ULL,0ULL,0ULL };
__device__ __constant__ uint64_t _1[] = { 1ULL,0ULL,0ULL,0ULL,0ULL };
__device__ __constant__ uint64_t _P[] = { 0xFFFFFFFEFFFFFC2F,0xFFFFFFFFFFFFFFFF,0xFFFFFFFFFFFFFFFF,0xFFFFFFFFFFFFFFFF,0ULL };
__device__ __constant__ uint64_t MM64 = 0xD838091DD2253531; // 64bits lsb negative inverse of P (mod 2^64)
__device__ __constant__ uint64_t _O[] = { 0xBFD25E8CD0364141ULL,0xBAAEDCE6AF48A03BULL,0xFFFFFFFFFFFFFFFEULL,0xFFFFFFFFFFFFFFFFULL 
__device__ __constant__ uint64_t jD[NB_JUMP][4];
__device__ __constant__ uint64_t jPx[NB_JUMP][4];
__device__ __constant__ uint64_t jPy[NB_JUMP][4];

I will definitely reduce jD to 128 bits in the next release, the less constant mem usage is better, there is 64Kb available but for L1 cache the lowest is the best.

128 bit * 32 = 4kB saved, good.

If you accept to break the compatibility with the #115 search, you can save another 1kB picking as jumps points with the first 32 bits of the x-coordinate = 0; you have many of them in the file of the old DPs.

Jean_Luc

sr. member

Activity: 462

Merit: 701

Quote from: arulbero on June 18, 2020, 05:23:26 PM

@JeanLuc
How much constant memory do you use for the multiplication and for the addition?
32 jumps are 16kB for x and y-coordinate + 8 kB for their private keys (32 * 256bit = 8kB) + what else?

I use the following setting to prefer L1 cache as shared mem is not used.
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);

In constant mem:

Code:

__device__ __constant__ uint64_t _0[] = { 0ULL,0ULL,0ULL,0ULL,0ULL };
__device__ __constant__ uint64_t _1[] = { 1ULL,0ULL,0ULL,0ULL,0ULL };
__device__ __constant__ uint64_t _P[] = { 0xFFFFFFFEFFFFFC2F,0xFFFFFFFFFFFFFFFF,0xFFFFFFFFFFFFFFFF,0xFFFFFFFFFFFFFFFF,0ULL };
__device__ __constant__ uint64_t MM64 = 0xD838091DD2253531; // 64bits lsb negative inverse of P (mod 2^64)
__device__ __constant__ uint64_t _O[] = { 0xBFD25E8CD0364141ULL,0xBAAEDCE6AF48A03BULL,0xFFFFFFFFFFFFFFFEULL,0xFFFFFFFFFFFFFFFFULL 
__device__ __constant__ uint64_t jD[NB_JUMP][4];
__device__ __constant__ uint64_t jPx[NB_JUMP][4];
__device__ __constant__ uint64_t jPy[NB_JUMP][4];

I will definitely reduce jD to 128 bits in the next release, the less constant mem usage is better, there is 64Kb available but for L1 cache the lowest is the best.

Quote from: Etar on June 18, 2020, 01:22:40 PM

In that case best choice for solving keys is using CPU Grin

Yes, this is true for small range (as written in the README).

Quote from: arulbero on June 18, 2020, 01:41:51 PM

Great work!

Thanks

Quote from: arulbero on June 18, 2020, 01:41:51 PM

For the #120, if you and Zielar use 2^30 kangaroos, you need to use a DP < 28.

Yes, we didn't launch the run yet, we will make our choice in the days to come. Small DP also increase needed mem.

Quote from: arulbero on June 18, 2020, 01:41:51 PM

If you reduce k, you reduce the speed, then you have to reduce theta (DP).

Right.

Quote from: arulbero on June 18, 2020, 01:41:51 PM

How many kangaroos run in parallel on a single V100 ? At wich speed?

~2^20 for the last 2 runs, we will see for #120.

arulbero

legendary

Activity: 1948

Merit: 2097

Quote from: Etar on June 18, 2020, 04:20:39 PM

Also i done test 1000 pubs with the same range but with normal soving without tricks.
here result:
Total OP: 273125509453.87 = 2^37.99
Average OP: 28.04

Unfortunately the difference is very small.

--------------------------------------------------------------------------------------------------------

I read this article:

https://medium.com/@johncantrell97/how-i-checked-over-1-trillion-mnemonics-in-30-hours-to-win-a-bitcoin-635fe051a752

this is the puzzle https://twitter.com/alistairmilne/status/1266037520715915267

I think that zielar could have won that prize easily too.

About this part of the arcticle:

Quote

In a GPU you have four main types of memory available to you (Global, Constant, Local, and Private). Global memory is shared across all GPU cores and is very slow to access, you want to minimize its use as much as possible. Constant and Private memory are extremely fast but limited in space. I believe most devices only support 64kB of constant memory. Local memory is shared by a “group” of workers and its speed is somewhere between Global and Constant.

My goal was to fit everything I needed into the 64kB of constant memory and never need to read from global or local memory to maximize the speed of the program. This proved to be a bit tricky because the standard precomputed secp256k1 multiplication table took up exactly 64kB by itself.

@JeanLuc

How much constant memory do you use for the multiplication and for the addition?

32 jumps are 16kB for x and y-coordinate + 8 kB for their private keys (32 * 256bit = 8kB) + what else?

Etar

sr. member

Activity: 653

Merit: 316

Also i done test 1000 pubs with the same range but with normal soving without tricks.
here result:
Total OP: 273125509453.87 = 2^37.99
Average OP: 28.04

Etar

sr. member

Activity: 653

Merit: 316

Quote from: brainless on June 18, 2020, 02:05:21 PM

-snip-
Etar i will give you experiment in other downbit range, in PM mode, then u will come to know whats difrence

ok

brainless

member

Activity: 348

Merit: 34

Quote from: Etar on June 18, 2020, 01:52:52 PM

Quote from: brainless on June 18, 2020, 01:45:25 PM

-snip-
for you 1000 pubkeys in 54 bit range total time for check ?

i don`t know how many time was spent for test but calc say that need average time 2^ 3.8s for 1 Pub with 20Mop/s on CPU

in short i can say for check 1000 keys from 54 bit, your actual work close to 74/75/76 bit range
and it should be 54 bit total keys x 1000, and get your bit range for real work should be,
btw you all are working in opposite direction, you are trying to change resource like dp, G etc, and in my view most fast way is reverse corresponding pubkey, leave all dp files of 115 as its, and just try my 80 pubkeys for 115 bit already generated dp file, and it will save 80% work and time instead to shift dp/G into 120 bit
Etar i will give you experiment in other downbit range, in PM mode, then u will come to know whats difrence

Etar

sr. member

Activity: 653

Merit: 316

Quote from: brainless on June 18, 2020, 01:45:25 PM

-snip-
for you 1000 pubkeys in 54 bit range total time for check ?

i don`t know how many time was spent for test but calc say that need average time 2^ 3.8s for 1 Pub with 20Mop/s on CPU

brainless

member

Activity: 348

Merit: 34

Quote from: brainless on June 18, 2020, 01:39:10 PM

Quote from: Etar on June 18, 2020, 01:35:37 PM

Quote from: brainless on June 18, 2020, 01:30:51 PM

-snip-
Etar
if i am not wrong
you have 49 bit 1000 pubkeys
you shift dp file to 54 bit
and you have this info
" Expected operations 2^28.06, Total op was 261530636052 = 2^37.93 "
if i am not wrong you total op 2^38 work = to search 1 pubkey in 2^76 bitrange
76 bit = 75557863725914323419135
54 bit = 18014398509481983
distance = 4194303 for 1000 pub key
and for 1 pubkey = 419.4303
and for 32*g = 134217.72
in short you used tooo much time as compare to equal 76 bit work
hope u understand

2^37.93 for ALL pubkeys
for 1 pubkeys average op is 2^27.98
And i didn`t have 1000pubkeys in 49bit range. I solve 1 keys and convert work file to range 54bit.
In 54bit i solve 1000pubkeys

run any random pubkey for 76 bit, and see you need these 2^37.93 for finish

for you 1000 pubkeys in 54 bit range total time for check ?

Etar

sr. member

Activity: 653

Merit: 316

Quote from: brainless on June 18, 2020, 01:39:10 PM

-snip-
run any random pubkey for 76 bit, and see you need these 2^37.93 for finish

Range 2^ 54.0
Expected OP 2^ 28.054231864896956
Range 2^ 76.0
Expected OP 2^ 39.05419477968964

COBRAS

member

Activity: 873

Merit: 22

$$P2P BTC BRUTE.JOIN NOW ! https://uclck.me/SQPJk

Quote from: COBRAS on June 17, 2020, 04:27:41 AM

@JeanLuc make please option for save all dead kangaroo to .txt file in readable format ? This is needed information for understand mistake with ranges and -d param.

@JeanLuc and make please counting +/- dead kangaroo: [Dead 10] -> [Dead total: 10 ["+" 8, "-" 2 ]]

BR.

@JeanLuc

Buddy, will you reliase this ?

BR

arulbero

legendary

Activity: 1948

Merit: 2097

Quote from: Jean_Luc on June 18, 2020, 01:07:55 PM

The best time complexity with the birthday paradox and kangaroo having a starting position uniformly distributed in [0..N] is obtained when drawing alternatively one tame and one wild. When using DP0, you get ~2.sqrt(N). When using DPx, it is like drawing alternatively 2^x TAME and then 2^x WILD, you have then a DP overhead.
I managed to get the formula for parallel version which is: ~2.CubicRoot( N (k.theta + sqrt(N)) ) where theta=2^dpbit and k the number of kangaroo-2 running in parallel.
For theta=1 (DP0) and k=0 (or when k.theta << sqrt(N)) we well get ~2.sqrt(N), when k.theta >> sqrt(N), the time complexity tend to ~2.CubicRoot(k.N.theta).
So it is important to choose a DP such as k.theta << sqrt(N).

Great work!

The number of kangaroos running in parallel transforms DP (that means less RAM) in a overhead.

For the #120, if you and Zielar use 2^30 kangaroos, you need to use a DP < 28.

The problem is: how fast is a single kangaroo? You can't 'concentrate' the speed of several kangaroo in more speed for one. And it takes a lot for a single kangaroo to walk through a path of length = 2^28.

If you reduce k, you reduce the speed, then you have to reduce theta (DP).

How many kangaroos run in parallel on a single V100 ? At wich speed?

brainless

member

Activity: 348

Merit: 34

Quote from: Etar on June 18, 2020, 01:35:37 PM

Quote from: brainless on June 18, 2020, 01:30:51 PM

-snip-
Etar
if i am not wrong
you have 49 bit 1000 pubkeys
you shift dp file to 54 bit
and you have this info
" Expected operations 2^28.06, Total op was 261530636052 = 2^37.93 "
if i am not wrong you total op 2^38 work = to search 1 pubkey in 2^76 bitrange
76 bit = 75557863725914323419135
54 bit = 18014398509481983
distance = 4194303 for 1000 pub key
and for 1 pubkey = 419.4303
and for 32*g = 134217.72
in short you used tooo much time as compare to equal 76 bit work
hope u understand

2^37.93 for ALL pubkeys
for 1 pubkeys average op is 2^27.98
And i didn`t have 1000pubkeys in 49bit range. I solve 1 keys and convert work file to range 54bit.
In 54bit i solve 1000pubkeys

run any random pubkey for 76 bit, and see you need these 2^37.93 for finish

Etar

sr. member

Activity: 653

Merit: 316

Quote from: brainless on June 18, 2020, 01:30:51 PM

-snip-
Etar
if i am not wrong
you have 49 bit 1000 pubkeys
you shift dp file to 54 bit
and you have this info
" Expected operations 2^28.06, Total op was 261530636052 = 2^37.93 "
if i am not wrong you total op 2^38 work = to search 1 pubkey in 2^76 bitrange
76 bit = 75557863725914323419135
54 bit = 18014398509481983
distance = 4194303 for 1000 pub key
and for 1 pubkey = 419.4303
and for 32*g = 134217.72
in short you used tooo much time as compare to equal 76 bit work
hope u understand

2^37.93 for ALL pubkeys
for 1 pubkeys average op is 2^27.98
And i didn`t have 1000pubkeys in 49bit range. I solve 1 keys and convert work file to range 54bit.
In 54bit i solve 1000pubkeys. And work file will the same for each pubkeys i didn`t append DP to file while solving range 54bit.

brainless

member

Activity: 348

Merit: 34

Quote from: Etar on June 18, 2020, 12:10:07 PM

Quote from: arulbero on June 18, 2020, 11:16:08 AM

-snip-
The gain (- 5,4%) is little, how many DPs have you generated for the key in 49 bit?
-snip-

DP Count : 240568 2^17.876 in 49bit workfile. i have source file and converted file. I can share if you want to verify DPs. But each DP after multiplication was verifed with G' and x-coordinate correct.

Code:

DP bits : 8
Start : 40000000000000
Stop : 7FFFFFFFFFFFFF
Key : 025C396BA4347253BBAAFFAC6D4F9BA092847B27F2599EB2EB225DDA54F9964190
Count : 0 2^-inf
Time : 00s
DP Size : 9.3/30.6MB
DP Count : 240568 2^17.876
HT Max : 8 [@ 01D30E]
HT Min : 0 [@ 000001]
HT Avg : 0.92
HT SDev : 0.95

Quote from: COBRAS on June 18, 2020, 11:11:15 AM

-snip-
"Total op was 261530636052 = 2^37.93" this for 1000 pubkey or only one ?

For 1000 pubkeys

Quote from: COBRAS on June 18, 2020, 11:17:19 AM

-snip-

Etar, this is a CPU or GPU version ?

Test done on CPU

Etar
if i am not wrong
you have 49 bit 1000 pubkeys
you shift dp file to 54 bit
and you have this info
" Expected operations 2^28.06, Total op was 261530636052 = 2^37.93 "
if i am not wrong you total op 2^38 work = to search 1 pubkey in 2^76 bitrange
76 bit = 75557863725914323419135
54 bit = 18014398509481983
distance = 4194303 for 1000 pub key
and for 1 pubkey = 419.4303
and for 32*g = 134217.72
in short you used tooo much time as compare to equal 76 bit work
hope u understand

Etar

sr. member

Activity: 653

Merit: 316

Quote from: Jean_Luc on June 18, 2020, 01:07:55 PM

The best time complexity with the birthday paradox and kangaroo having a starting position uniformly distributed in [0..N] is obtained when drawing alternatively one tame and one wild. When using DP0, you get ~2.sqrt(N). When using DPx, it is like drawing alternatively 2^x TAME and then 2^x WILD, you have then a DP overhead.
I managed to get the formula for parallel version which is: ~2.CubicRoot( N (k.theta + sqrt(N)) ) where theta=2^dpbit and k the number of kangaroo-2 running in parallel.
For theta=1 (DP0) and k=0 (or when k.theta << sqrt(N)) we well get ~2.sqrt(N), when k.theta >> sqrt(N), the time complexity tend to ~2.CubicRoot(k.N.theta).
So it is important to choose a DP such as k.theta << sqrt(N).

In that case best choice for solving keys is using CPU Grin

i7 7 thread 20Mop/s and 7168 kangaroos
2080ti 1.4Gop/s and 4464292 kangaroos
1.4Gop/s / 20Mop/s = 70 CPU the same speed but also only 501760 Kangaroos, 8.9 times less DP overhead. Wink

Jean_Luc

sr. member

Activity: 462

Merit: 701

The best time complexity with the birthday paradox and kangaroo having a starting position uniformly distributed in [0..N] is obtained when drawing alternatively one tame and one wild. When using DP0, you get ~2.sqrt(N). When using DPx, it is like drawing alternatively 2^x TAME and then 2^x WILD, you have then a DP overhead.
I managed to get the formula for parallel version which is: ~2.CubicRoot( N (k.theta + sqrt(N)) ) where theta=2^dpbit and k the number of kangaroo-2 running in parallel.
For theta=1 (DP0) and k=0 (or when k.theta << sqrt(N)) we well get ~2.sqrt(N), when k.theta >> sqrt(N), the time complexity tend to ~2.CubicRoot(k.N.theta).
So it is important to choose a DP such as k.theta << sqrt(N).

arulbero

legendary

Activity: 1948

Merit: 2097

Quote from: Etar on June 18, 2020, 12:41:23 PM

Quote from: arulbero on June 18, 2020, 12:31:12 PM

-snip-

To get 2^27.98, it is like 2^25.876 was worth about 2^21.8, the effect of reuse of old DPs is reduced by a factor of 16.

I think that for birthday paradox is a big difference between have 16Tame DP and 2 wild DP or 9tame and 9wild.
Maybe if Kangaroo app first launch wild (as you say somewhere above) gain the same amount as tame DPs maybe in this case we will have faster result.

Yes, but in that case there is no big difference:

(2^x + 2^25.876) tame * (2^x + 2^25.876) wild = 2^54 couples

2^x + 2^25.876 = 2^27

x = log2(2^27-2^25.876) = 26.1

--> only 2^26.1 tame + 2^26.1 wild + 2^25.876 wild steps are needed -> 2^27.61

Etar

sr. member

Activity: 653

Merit: 316

Quote from: arulbero on June 18, 2020, 12:31:12 PM

-snip-

To get 2^27.98, it is like 2^25.876 was worth about 2^21.8, the effect of reuse of old DPs is reduced by a factor of 16.

I think that for birthday paradox is a big difference between have 16Tame DP and 2 wild DP or 9tame and 9wild.
Maybe if Kangaroo app first launch wild (as you say somewhere above) gain the same amount as tame DPs maybe in this case we will have faster result.

arulbero

legendary

Activity: 1948

Merit: 2097

Quote from: Etar on June 18, 2020, 12:10:07 PM

Quote from: arulbero on June 18, 2020, 11:16:08 AM

-snip-
The gain (- 5,4%) is little, how many DPs have you generated for the key in 49 bit?
-snip-

DP Count : 240568 2^17.876 in 49bit workfile

Then you have computed 2^17.876 * 2^8 = 2^25.876 points,

(2^x + 2^25.876) tame * (2^x) wild = 2^54 couples

--> x = 26.67

In theory it would be enough 2^27.67 steps instead of 2^27.98.

To get 2^27.98, it is like 2^25.876 was worth about 2^21.8, the effect of reuse of old DPs is reduced by a factor of 16.

Etar

sr. member

Activity: 653

Merit: 316

Quote from: COBRAS on June 18, 2020, 12:22:22 PM

"For 1000 pubkeys"

Think I not understand something but, for 1000 pubkeys 2^37.93 is very fast.

expected op for 1 pub = 2^28.06
1000 pubkeys ~2^ 9.965
2^28.06*2^9.965=(28.06+9.965)=2^38.025
So 2^37.93 is little-little bit faster then expected

Topic: Pollard's kangaroo ECDLP solver - page 95. (Read 60381 times)