I understand that -t is the number of threads.
The question is why the speed increases, but time does not decrease, more often it even increases.
1) Algo of Pollard is probablistic nature (Birthday Paradox, Kruskal's card trick, etc)
The solution time is not stable; several attempts (timeit loop in python script) are needed to obtain the avg runtime and jumps.
heuristic time rather than deterministic
2) threads should run at equal speed.
If input -t N over than really N cores, so threads will compete, some will be faster and some will be left behind.
3) maybe my code have bug, need more tests
#################
post dedicated cuda/opencl implementations
we have some ways for just ready GPU implement
good candidates to rewrite:
1) cuda, c++, VanitySearch, by Jean_Luc
https://github.com/JeanLucPons/VanitySearch2) cuda/opencl, c++, BitCrack, by brichard19
https://github.com/brichard19/BitCrack2) cuda, python, pollard-rho, by brichard19
https://github.com/brichard19/ecdl3) cuda, ?, pollard-rho
github.com/beranm14/pollard_rho/
4) opencl, C#, pollard-rho
github.com/twjvdhorst/Parallel-Pollard-Rho/
I intend to evaluate their performance, as well as rewrite one or more to kangaroo.
#################
multicore, CPU only, c++, based on engine VanitySearch1.15
https://github.com/Telariust/vs-kangaroo/releasesbignum lib by Jean_Luc is good, only %15 slower than bitcoin-core/secp256k1 lib (with raw asm mult)
C:\Users\User\source\repos\VanitySearch-1.15_kangaroo\x64\Rel_SM52>vs-kangaroo -v 1 -t 4 -bits 42
[###########################################################]
[# Pollard-kangaroo PrivKey Recovery Tool #]
[# (based on engine of VanitySearch 1.15) #]
[# bitcoin ecdsa secp256k1 #]
[# ver0.01 #]
[###########################################################]
[DATE(utc)] 08 Oct 2019 23:07:47
[~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]
[pow2bits] 42
[rangeW] 2^41..2^42 ; W = U - L = 2^41
[DPsize] 1024 (hashtable size)
[~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]
[pubkey#42] loaded
[Xcoordinate] EEC88385BE9DA803A0D6579798D977A5D0C7F80917DAB49CB73C9E3927142CB6
[Ycoordinate] 28AFEA598588EA50A6B11E552F8574E0B93ABD5595F5AA17EA3BE5304103D255
[~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]
[+] Sp-table of pow2 points - ready
[~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]
[+] recalc Sp-table of multiply UV
[UV] U*V=1*3=3 (0x03)
[optimal_mean_jumpsize] 2097152
[meanjumpsize#24] 2097151(now) <= 2097152(optimal) <= 4026531(next)
[i] Sp[24]|J------------------------------------------------------------|Sp[25]
[JmaxofSp] Sp[24]=2097151 nearer to optimal mean jumpsize of Sp set
[DPmodule] 2^18 = 262144 (0x0000000000040000)
[+] 1T+3W kangaroos - ready
[CPU] threads: 4
[th][tame#1] run..
[th][wild#1] run..
[th][wild#2] run..
[th][wild#3] run..
[|][ 00:00:04 ; 1.1M j/s; 4.0Mj 107.4%; dp/kgr=10.0; 00:00:00 ]
[~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]
[prvkey#42] 0x000000000000000000000000000000000000000000000000000002A221C58D8F
[i] 1.0M j/s; 5.0Mj of 4.0Mj 123.6%; DP 1T+3W=8+14=22; dp/kgr=11.0;
[runtime] 00:00:05
[~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]
[DATE(utc)] 08 Oct 2019 23:07:53
[~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]
[x] EXIT
#################
BitCrack runs at about 715 MKeys/s on Tesla V100 -
look here.
If we remove hash160, the rate would be 1430 Mj/s. Mj here stands for million kangaroo jumps.
not
BitCrack use batch packet inversion (x8 speed-up).
We cant(OR CAN?..) use its in kangaroo method, so 1430/8 = 175M j/s
but on screen
see 4xV100=6515M j/s, each 1600M j/s
now unclear..
#################
https://bitcointalksearch.org/topic/m.48224432November 25, 2018 this man speaks as if he already has a finished implementation.
O great master, j2002ba2, I urge you!
Plz, come and explain to the miserable mortals, which means s2,s1,m3,m4 in your program!
#################
https://docs.nvidia.com/cuda/volta-tuning-guide/index.htmlhe high-priority recommendations from those guides are as follows:
- Find ways to parallelize sequential code;
- Minimize data transfers between the host and the device;
- Adjust kernel launch configuration to maximize device utilization;
- Ensure global memory accesses are coalesced;
- Minimize redundant accesses to global memory whenever possible;
- Avoid long sequences of diverged execution by threads within the same warp;
about last, " - Avoid long sequences of diverged execution by threads within the same warp;"
parallelism algo by Pollard (U+V) escape problem of collisions between kangaroos of the same herd;
this allows you to completely abandon the correction block because adding IF () inevitably breaks the parallelism of the GPU;
#################
to be continued..