BSGS solver for cuda - page 9. | Bitcointalksearch.org

studyroom1

jr. member

Activity: 40

Merit: 7

Quote from: Etar on October 15, 2021, 08:51:19 AM

Seems like i fix app.. Grin

Replace most commands with unofficial _v2

Code:

GPU #0 launched
GPU #0 TotalBuff: 5168.000Mb
GPU#0 Cnt:0000000000000000000000000000000000000000000000000000000000000001
GPU#0 Cnt:00000000000000000000000000000000000000000000000015ea000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:0000000000000000000000000000000000000000000000002bd4000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:00000000000000000000000000000000000000000000000041be000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:00000000000000000000000000000000000000000000000057a8000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:0000000000000000000000000000000000000000000000006d92000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:000000000000000000000000000000000000000000000000835a000000000001 696MKey/s x536870912 2^29.44 x2^30=2^59.44
GPU#0 Cnt:0000000000000000000000000000000000000000000000009922000000000001 696MKey/s x536870912 2^29.44 x2^30=2^59.44
***********GPU#0************
Total solutions: 1
KEY!!>000000000000000000000000000000000000000000000001a838b13505b26867
Pub: 30210c23b1a047bc9bdbb13448e67deddc108946de6de639bcc75d47c0216b1be383c4a8ed4fac77c0d2ad737d8499a362f483f8fe39d1e86aaed578a9455dfc
****************************
Found in 17 seconds

Result above with -w 29
Also speed is little increased.

awesome bro , thanks for sharing all this ~~ will test it

Etar

sr. member

Activity: 653

Merit: 316

Seems like i fix app.. Grin

Replace most commands with unofficial _v2

Code:

GPU #0 launched
GPU #0 TotalBuff: 5168.000Mb
GPU#0 Cnt:0000000000000000000000000000000000000000000000000000000000000001
GPU#0 Cnt:00000000000000000000000000000000000000000000000015ea000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:0000000000000000000000000000000000000000000000002bd4000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:00000000000000000000000000000000000000000000000041be000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:00000000000000000000000000000000000000000000000057a8000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:0000000000000000000000000000000000000000000000006d92000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:000000000000000000000000000000000000000000000000835a000000000001 696MKey/s x536870912 2^29.44 x2^30=2^59.44
GPU#0 Cnt:0000000000000000000000000000000000000000000000009922000000000001 696MKey/s x536870912 2^29.44 x2^30=2^59.44
***********GPU#0************
Total solutions: 1
KEY!!>000000000000000000000000000000000000000000000001a838b13505b26867
Pub: 30210c23b1a047bc9bdbb13448e67deddc108946de6de639bcc75d47c0216b1be383c4a8ed4fac77c0d2ad737d8499a362f483f8fe39d1e86aaed578a9455dfc
****************************
Found in 17 seconds

Result above with -w 29
Also speed is little increased.

studyroom1

jr. member

Activity: 40

Merit: 7

Quote from: Etar on October 15, 2021, 06:06:56 AM

Quote from: studyroom1 on October 15, 2021, 05:54:05 AM

does memory allocation in gpu maks difference in speed?
how to know T, P and b optimal value for my card (3080)?
what is W and -htsz role?
and what is item size ?
can i occupy more ram in computer to give some speed boost as i have 128GB memory ? if yes how can ?

-t use 512 for your 3080
-b use 68, shoud be multiples of SM count your cars(3080 have 68 SM)
-p use 256, this value mean how many xpoints will compute each thread in kernel.
-w it is number of baby step, -w 26 mean create array with size 2^26 as large this array then more big giant step. But you should check you hashrate when increase -w it shodn`t drop more then 1.5 times. For ex, your hashrate with -w 26 is 1500 Mkeys and if with -w 27 your hashrate is more then 1000 mkeys then there will be sense to increase -w

-htsz use default 25, it is size of Hash Table. you can change -htsz only if you have small baby aray(-w) less then Hash Table size

awesome , big thanks

Etar

sr. member

Activity: 653

Merit: 316

Quote from: studyroom1 on October 15, 2021, 05:54:05 AM

does memory allocation in gpu maks difference in speed?
how to know T, P and b optimal value for my card (3080)?
what is W and -htsz role?
and what is item size ?
can i occupy more ram in computer to give some speed boost as i have 128GB memory ? if yes how can ?

-t use 512 for your 3080
-b use 68, shoud be multiples of SM count your cars(3080 have 68 SM)
-p use 256, this value mean how many xpoints will compute each thread in kernel.
-w it is number of baby step, -w 26 mean create array with size 2^26 as large this array then more big giant step. But you should check you hashrate when increase -w it shodn`t drop more then 1.5 times. For ex, your hashrate with -w 26 is 1500 Mkeys and if with -w 27 your hashrate is more then 1000 mkeys then there will be sense to increase -w

-htsz use default 25, it is size of Hash Table. you can change -htsz only if you have small baby aray(-w) less then Hash Table size

studyroom1

jr. member

Activity: 40

Merit: 7

please take a look in these 2 URLs ~ they fixed this issue.

https://github.com/BOINC/boinc/issues/1773
https://github.com/BOINC/boinc/pull/2707
perhaps you will get some clue

studyroom1

jr. member

Activity: 40

Merit: 7

Quote from: Etar on October 15, 2021, 05:43:34 AM

Quote from: studyroom1 on October 15, 2021, 05:36:10 AM

ahan

, i am not good at cuda or in programing , but if i use -i in kangaroo , it is returning correct parameters of memory.

is it possible to mix some codes from kangaroo side ? or any way to hardcode memory ?

Ussualy people used cuda runtime api it is different library incompatible with cuda driver api.
I was try to solve 32bit limitation few years ago as soon as the first cards with more than 4GB memory appeared.
But unfortunately this limit could not be overcome.
And do you need to utilize all the memory?
On my 2080ti already at -w 27 the hash rate drops from 570mkeys to 81. While at 3070 everything is fine.
So you need first to check how your hashrate will decrease with increasing parameter -w.
here is with cuDeviceTotalMem_v2

Code:

APP VERSION: 1.2.1
Found 1 Cuda device.
Cuda device:GeForce RTX 2080 Ti(11264Mb)
Device have: MP:68 Cores+4352
Shared memory total:49152
Constant memory total:65536

some question i have for my understanding

does memory allocation in gpu maks difference in speed?
how to know T, P and b optimal value for my card (3080)?
what is W and -htsz role?
and what is item size ?
can i occupy more ram in computer to give some speed boost as i have 128GB memory ? if yes how can ?

Etar

sr. member

Activity: 653

Merit: 316

Quote from: studyroom1 on October 15, 2021, 05:36:10 AM

ahan

, i am not good at cuda or in programing , but if i use -i in kangaroo , it is returning correct parameters of memory.

is it possible to mix some codes from kangaroo side ? or any way to hardcode memory ?

Ussualy people used cuda runtime api it is different library incompatible with cuda driver api.
I was try to solve 32bit limitation few years ago as soon as the first cards with more than 4GB memory appeared.
But unfortunately this limit could not be overcome.
And do you need to utilize all the memory?
On my 2080ti already at -w 27 the hash rate drops from 570mkeys to 81. While at 3070 everything is fine.
So you need first to check how your hashrate will decrease with increasing parameter -w.
here is with cuDeviceTotalMem_v2

Code:

APP VERSION: 1.2.1
Found 1 Cuda device.
Cuda device:GeForce RTX 2080 Ti(11264Mb)
Device have: MP:68 Cores+4352
Shared memory total:49152
Constant memory total:65536

return correct 64bit values but it is only information it is didn`t help to solve all limitation in cuda commands.

studyroom1

jr. member

Activity: 40

Merit: 7

Quote from: Etar on October 15, 2021, 05:31:56 AM

Quote from: studyroom1 on October 15, 2021, 05:27:15 AM

Thanks man for the information , can you please fix memory & ampere issue? is it possible ? and recompile it as i am unable to compile it via pure basic , free version have limitation

I can only add code to get correct number of ampere cores.
I can`t fix memory(it is more fix return 32bit values instead 64bit) because i can`t use unofficial _v2 comands with official commands in the same app.

ahan

, i am not good at cuda or in programing , but if i use -i in kangaroo , it is returning correct parameters of memory.

is it possible to mix some codes from kangaroo side ? or any way to hardcode memory ?

Etar

sr. member

Activity: 653

Merit: 316

Quote from: studyroom1 on October 15, 2021, 05:27:15 AM

Thanks man for the information , can you please fix memory & ampere issue? is it possible ? and recompile it as i am unable to compile it via pure basic , free version have limitation

I can only add code to get correct number of ampere cores.
I can`t fix memory(it is more fix return 32bit values instead 64bit) because i can`t use unofficial _v2 comands with official commands in the same app.

studyroom1

jr. member

Activity: 40

Merit: 7

Quote from: Etar on October 15, 2021, 05:24:17 AM

Quote from: studyroom1 on October 15, 2021, 03:29:17 AM

i think i found the problem the information which program is pulling from device is wrong or these are max value which intentionally hardcoded in program , Ethar can you please set all dynamic , i mean device should report all parameters

Found 1 Cuda device.
Cuda device:GeForce RTX 3080(4095Mb) wrong
Device have: MP:68 Cores+0 wrong
Shared memory total:49152 i guess this is system memory but avaiable is 128GB
Constant memory total:65536 not sure how calculate this one

i am not sure but MP is unit of AMD cards and cuda for Nvidia , and cuda is 8k+ in 3080 but not sure what is 68 cores here
so many confusions

Program used cuda driver api(not runtime api that ussualy used) and code for GPU writed on ptx.
cuda.lib that used to call cuda driver api even x64 version alwayse return 32bit values.
In that case you can`t use/allocate GPU memory more than 2**32bytes
Also cuDeviceTotalMem() return 32bit values of memory that is why you see 4095mb
I write about this issues to nvidia few times but according to them they have no problem)
if you are looking to cuda.lib you will fined unofficial commands like cuDeviceTotalMem_v2 and other.
All this commands have prefix _v2 and this comands return correct 64bit values.
But nvidia say that they does not have commands with prefix _v2 ))
It is about limitation of 2**32 bytes GPU memory
About Device have: MP:68 Cores+0, here 0 because i didn`t add Ampere to programm:

Code:

Case 2 ;Fermi
            Debug "Fermi"
            If minor=1
              cores = mp * 48
            Else 
              cores = mp * 32
            EndIf
          Case 3; Kepler 
            Debug "Kepler"
            cores = mp * 192
            
          Case 5; Maxwell 
            Debug "Maxwell"
            cores = mp * 128
            
          Case 6; Pascal 
            Debug "Pascal"
            cores = mp * 64
            
          Case 7; Pascal 
            Debug "Pascal RTX"
            cores = mp * 64
          Default
            Debug "Unknown device type"
        EndSelect

by the way it need only for information and nothing more
to get corect number of cores need add only this

Code:

          Case 8; Ampere 
            Debug "Ampere RTX"
            cores = mp * 128
          Default
            Debug "Unknown device type"

Thanks man for the information , can you please fix memory & ampere issue? is it possible ? and recompile it as i am unable to compile it via pure basic , free version have limitation

Etar

sr. member

Activity: 653

Merit: 316

Quote from: studyroom1 on October 15, 2021, 03:29:17 AM

i think i found the problem the information which program is pulling from device is wrong or these are max value which intentionally hardcoded in program , Ethar can you please set all dynamic , i mean device should report all parameters

Found 1 Cuda device.
Cuda device:GeForce RTX 3080(4095Mb) wrong
Device have: MP:68 Cores+0 wrong
Shared memory total:49152 i guess this is system memory but avaiable is 128GB
Constant memory total:65536 not sure how calculate this one

i am not sure but MP is unit of AMD cards and cuda for Nvidia , and cuda is 8k+ in 3080 but not sure what is 68 cores here
so many confusions

Program used cuda driver api(not runtime api that ussualy used) and code for GPU writed on ptx.
cuda.lib that used to call cuda driver api even x64 version alwayse return 32bit values.
In that case you can`t use/allocate GPU memory more than 2**32bytes
Also cuDeviceTotalMem() return 32bit values of memory that is why you see 4095mb
I write about this issues to nvidia few times but according to them they have no problem)
if you are looking to cuda.lib you will fined unofficial commands like cuDeviceTotalMem_v2 and other.
All this commands have prefix _v2 and this comands return correct 64bit values.
But nvidia say that they does not have commands with prefix _v2 ))
It is about limitation of 2**32 bytes GPU memory
About Device have: MP:68 Cores+0, here 0 because i didn`t add Ampere to programm:

Code:

Case 2 ;Fermi
            Debug "Fermi"
            If minor=1
              cores = mp * 48
            Else 
              cores = mp * 32
            EndIf
          Case 3; Kepler 
            Debug "Kepler"
            cores = mp * 192
            
          Case 5; Maxwell 
            Debug "Maxwell"
            cores = mp * 128
            
          Case 6; Pascal 
            Debug "Pascal"
            cores = mp * 64
            
          Case 7; Pascal 
            Debug "Pascal RTX"
            cores = mp * 64
          Default
            Debug "Unknown device type"
        EndSelect

by the way it need only for information and nothing more
to get corect number of cores need add only this

Code:

          Case 8; Ampere 
            Debug "Ampere RTX"
            cores = mp * 128
          Default
            Debug "Unknown device type"

studyroom1

jr. member

Activity: 40

Merit: 7

i agree with you but free purebasic program can compile only small code lines so that's why i need help from @Etar

and program is setting memory automatically but calculating it wrong

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: studyroom1 on October 15, 2021, 03:29:17 AM

i think i found the problem the information which program is pulling from device is wrong or these are max value which intentionally hardcoded in program , Ethar can you please set all dynamic , i mean device should report all parameters

Found 1 Cuda device.
Cuda device:GeForce RTX 3080(4095Mb) wrong
Device have: MP:68 Cores+0 wrong
Shared memory total:49152 i guess this is system memory but avaiable is 128GB
Constant memory total:65536 not sure how calculate this one

i am not sure but MP is unit of AMD cards and cuda for Nvidia , and cuda is 8k+ in 3080 but not sure what is 68 cores here
so many confusions

There is no need to wait for a patch, you can independently get these stats on an NVIDIA card using their sample DeviceQuery program: https://github.com/NVIDIA/cuda-samples/blob/master/Samples/deviceQuery/deviceQuery.cpp - It needs to be compiled from source though but it's extremely easy to do since it's only a single file.

studyroom1

jr. member

Activity: 40

Merit: 7

i think i found the problem the information which program is pulling from device is wrong or these are max value which intentionally hardcoded in program , Ethar can you please set all dynamic , i mean device should report all parameters

Found 1 Cuda device.
Cuda device:GeForce RTX 3080(4095Mb) wrong
Device have: MP:68 Cores+0 wrong
Shared memory total:49152 i guess this is system memory but avaiable is 128GB
Constant memory total:65536 not sure how calculate this one

i am not sure but MP is unit of AMD cards and cuda for Nvidia , and cuda is 8k+ in 3080 but not sure what is 68 cores here
so many confusions

studyroom1

jr. member

Activity: 40

Merit: 7

Quote from: NotATether on October 15, 2021, 02:52:45 AM

Quote from: studyroom1 on October 15, 2021, 02:37:15 AM

speed is also slower than Kangaroo around 1200M i am getting , but i want to tweak to utilize max gpu memory and max ram with max power , increase item size will slow down speed and take longer to solve .

any idea how to tweak

Possibly due to "memory fragmentation" that happens when the program allocates GPU memory for one stuct, it's allocated in the middle of GPU memory and that will limit the maximum contiguous memory allocation allowed on the GPU for other structs.

The resolution for it is to allocate the largest structure first (in this case the TotalBuff) and then the smaller ones last. It requires a code modification though, which is impossible to do without the source code.

source codes are available i guess here https://github.com/Etayson/BSGS-cuda/blob/main/bsgscudaussualHTchangeble1_2.pb can you check please

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: studyroom1 on October 15, 2021, 02:37:15 AM

speed is also slower than Kangaroo around 1200M i am getting , but i want to tweak to utilize max gpu memory and max ram with max power , increase item size will slow down speed and take longer to solve .

any idea how to tweak

Possibly due to "memory fragmentation" that happens when the program allocates GPU memory for one stuct, it's allocated in the middle of GPU memory and that will limit the maximum contiguous memory allocation allowed on the GPU for other structs.

The resolution for it is to allocate the largest structure first (in this case the TotalBuff) and then the smaller ones last. It requires a code modification though, which is impossible to do without the source code.

studyroom1

jr. member

Activity: 40

Merit: 7

GPU #0 launched
GPU #0 TotalBuff: 8112.000Mb
error cuMemAlloc-2
Press Enter to exit

i guess you hard coded 4096 GPU mem as i did everything but i am unable utilizing full GPU memory , my GPU is 3080 with 10GB

this is the max i can use

GPU #0 launched
GPU #0 TotalBuff: 3216.000Mb

speed is also slower than Kangaroo around 1200M i am getting , but i want to tweak to utilize max gpu memory and max ram with max power , increase item size will slow down speed and take longer to solve .

any idea how to tweak

Minase

member

Activity: 72

Merit: 43

i dont have 3xxx series available but based on specs i can calculate the average speed.
with one 3090 or 3080ti 2^51 operations should be done in 6-7 days
//edit
based on your previous post your tesla k80 will find (if lucky) a private key, if it's in range 100bit, in ~ 25-26 days

hamnaz

newbie

Activity: 9

Merit: 0

Quote from: Minase on October 14, 2021, 04:25:35 AM

Quote from: hamnaz on October 13, 2021, 01:48:29 PM

Quote from: PrivatePerson on October 13, 2021, 12:50:50 PM

Is it really possible to find a 100-bit key on one video card? How long does it take for this?

as i see 100bit puzzle was picked by telaurist who write first kangaroo ver in cpu, and he used 1 gpu to find it
maybe latest cards do it fast

It's quite possible to find 100bit puzzle with single video card and not even the most powerful one. (kangaroo method)
On single RTX 2060 you can find such a key in 34-35 days (2^51 operations). Sometimes you dont even need full 2^51, you can find the key even when you reach 2^50 (this means half of time ~17 days).
If we are talking about RTX 2080 then the speed is higher with almost 50% compared to 2060, this leads us to ~23 days for full 2^51 range.

with rtx 3xxx series maybe do it in hours ?
above 2 random key generate, one from first half and 2nd is 2nd half of 100 bit, i want to know how much fast rtx 3xxx series could found, i need to calc times, if you have rtx and have some time , to find above pubkeys in 100 bit, will help me to 3xxx power for time
thankx

Minase

member

Activity: 72

Merit: 43

Quote from: hamnaz on October 13, 2021, 01:48:29 PM

Quote from: PrivatePerson on October 13, 2021, 12:50:50 PM

Is it really possible to find a 100-bit key on one video card? How long does it take for this?

as i see 100bit puzzle was picked by telaurist who write first kangaroo ver in cpu, and he used 1 gpu to find it
maybe latest cards do it fast

It's quite possible to find 100bit puzzle with single video card and not even the most powerful one. (kangaroo method)
On single RTX 2060 you can find such a key in 34-35 days (2^51 operations). Sometimes you dont even need full 2^51, you can find the key even when you reach 2^50 (this means half of time ~17 days).
If we are talking about RTX 2080 then the speed is higher with almost 50% compared to 2060, this leads us to ~23 days for full 2^51 range.

Topic: BSGS solver for cuda - page 9. (Read 3980 times)