Pages:
Author

Topic: BSGS solver for cuda - page 9. (Read 3827 times)

jr. member
Activity: 40
Merit: 7
October 15, 2021, 01:33:39 PM
#47
Seems like i fix app..  Grin Replace most commands with unofficial _v2
Code:
GPU #0 launched
GPU #0 TotalBuff: 5168.000Mb
GPU#0 Cnt:0000000000000000000000000000000000000000000000000000000000000001
GPU#0 Cnt:00000000000000000000000000000000000000000000000015ea000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:0000000000000000000000000000000000000000000000002bd4000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:00000000000000000000000000000000000000000000000041be000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:00000000000000000000000000000000000000000000000057a8000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:0000000000000000000000000000000000000000000000006d92000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:000000000000000000000000000000000000000000000000835a000000000001 696MKey/s x536870912 2^29.44 x2^30=2^59.44
GPU#0 Cnt:0000000000000000000000000000000000000000000000009922000000000001 696MKey/s x536870912 2^29.44 x2^30=2^59.44
***********GPU#0************
Total solutions: 1
KEY!!>000000000000000000000000000000000000000000000001a838b13505b26867
Pub: 30210c23b1a047bc9bdbb13448e67deddc108946de6de639bcc75d47c0216b1be383c4a8ed4fac77c0d2ad737d8499a362f483f8fe39d1e86aaed578a9455dfc
****************************
Found in 17 seconds
Result above with -w 29
Also speed is little increased.


awesome bro , thanks for sharing all this ~~ will test it
sr. member
Activity: 652
Merit: 316
October 15, 2021, 08:51:19 AM
#46
Seems like i fix app..  Grin Replace most commands with unofficial _v2
Code:
GPU #0 launched
GPU #0 TotalBuff: 5168.000Mb
GPU#0 Cnt:0000000000000000000000000000000000000000000000000000000000000001
GPU#0 Cnt:00000000000000000000000000000000000000000000000015ea000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:0000000000000000000000000000000000000000000000002bd4000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:00000000000000000000000000000000000000000000000041be000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:00000000000000000000000000000000000000000000000057a8000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:0000000000000000000000000000000000000000000000006d92000000000001 697MKey/s x536870912 2^29.45 x2^30=2^59.45
GPU#0 Cnt:000000000000000000000000000000000000000000000000835a000000000001 696MKey/s x536870912 2^29.44 x2^30=2^59.44
GPU#0 Cnt:0000000000000000000000000000000000000000000000009922000000000001 696MKey/s x536870912 2^29.44 x2^30=2^59.44
***********GPU#0************
Total solutions: 1
KEY!!>000000000000000000000000000000000000000000000001a838b13505b26867
Pub: 30210c23b1a047bc9bdbb13448e67deddc108946de6de639bcc75d47c0216b1be383c4a8ed4fac77c0d2ad737d8499a362f483f8fe39d1e86aaed578a9455dfc
****************************
Found in 17 seconds
Result above with -w 29
Also speed is little increased.
jr. member
Activity: 40
Merit: 7
October 15, 2021, 06:09:36 AM
#45

does memory allocation in gpu maks difference in speed?
how to know T, P and b optimal value for my card (3080)?
what is W and -htsz role?
and what is item size ?
can i occupy more ram in computer to give some speed boost as i have 128GB memory ? if yes how can ?

-t use 512 for your 3080
-b use 68, shoud be multiples of SM count your cars(3080 have 68 SM)
-p use 256, this value mean how many xpoints will compute each thread in kernel.
-w it is number of baby step, -w 26 mean create array with size 2^26 as large this array then more big giant step. But you should check you hashrate when increase -w it shodn`t drop more then 1.5 times. For ex, your hashrate with -w 26 is 1500 Mkeys and if with -w 27 your hashrate is more then 1000 mkeys then there will be sense to increase -w

-htsz use default 25, it is size of Hash Table. you can change -htsz only if you have small baby aray(-w) less then Hash Table size


awesome  , big thanks 
sr. member
Activity: 652
Merit: 316
October 15, 2021, 06:06:56 AM
#44

does memory allocation in gpu maks difference in speed?
how to know T, P and b optimal value for my card (3080)?
what is W and -htsz role?
and what is item size ?
can i occupy more ram in computer to give some speed boost as i have 128GB memory ? if yes how can ?

-t use 512 for your 3080
-b use 68, shoud be multiples of SM count your cars(3080 have 68 SM)
-p use 256, this value mean how many xpoints will compute each thread in kernel.
-w it is number of baby step, -w 26 mean create array with size 2^26 as large this array then more big giant step. But you should check you hashrate when increase -w it shodn`t drop more then 1.5 times. For ex, your hashrate with -w 26 is 1500 Mkeys and if with -w 27 your hashrate is more then 1000 mkeys then there will be sense to increase -w

-htsz use default 25, it is size of Hash Table. you can change -htsz only if you have small baby aray(-w) less then Hash Table size
jr. member
Activity: 40
Merit: 7
October 15, 2021, 06:06:52 AM
#43
please take a look in these 2 URLs ~ they fixed this issue.

https://github.com/BOINC/boinc/issues/1773
https://github.com/BOINC/boinc/pull/2707
perhaps you will get some clue
jr. member
Activity: 40
Merit: 7
October 15, 2021, 05:54:05 AM
#42

ahan Sad,   i am not good at cuda or in programing , but if i use -i in kangaroo , it is returning correct parameters of memory.

is it possible to mix some codes from kangaroo side ? or any way to hardcode memory ?
Ussualy people used cuda runtime api it is different library incompatible with cuda driver api.
I was try to solve 32bit limitation few years ago as soon as the first cards with more than 4GB memory appeared.
But unfortunately this limit could not be overcome.
And do you need to utilize all the memory?
On my 2080ti already at -w 27 the hash rate drops from 570mkeys to 81. While at 3070 everything is fine.
So you need first to check how your hashrate will decrease with increasing parameter -w.
here is with cuDeviceTotalMem_v2
Code:
APP VERSION: 1.2.1
Found 1 Cuda device.
Cuda device:GeForce RTX 2080 Ti(11264Mb)
Device have: MP:68 Cores+4352
Shared memory total:49152
Constant memory total:65536


some question i have for my understanding

does memory allocation in gpu maks difference in speed?
how to know T, P and b optimal value for my card (3080)?
what is W and -htsz role?
and what is item size ?
can i occupy more ram in computer to give some speed boost as i have 128GB memory ? if yes how can ?
sr. member
Activity: 652
Merit: 316
October 15, 2021, 05:43:34 AM
#41

ahan Sad,   i am not good at cuda or in programing , but if i use -i in kangaroo , it is returning correct parameters of memory.

is it possible to mix some codes from kangaroo side ? or any way to hardcode memory ?
Ussualy people used cuda runtime api it is different library incompatible with cuda driver api.
I was try to solve 32bit limitation few years ago as soon as the first cards with more than 4GB memory appeared.
But unfortunately this limit could not be overcome.
And do you need to utilize all the memory?
On my 2080ti already at -w 27 the hash rate drops from 570mkeys to 81. While at 3070 everything is fine.
So you need first to check how your hashrate will decrease with increasing parameter -w.
here is with cuDeviceTotalMem_v2
Code:
APP VERSION: 1.2.1
Found 1 Cuda device.
Cuda device:GeForce RTX 2080 Ti(11264Mb)
Device have: MP:68 Cores+4352
Shared memory total:49152
Constant memory total:65536
return correct 64bit values but it is only information it is didn`t help to solve all limitation in cuda commands.
jr. member
Activity: 40
Merit: 7
October 15, 2021, 05:36:10 AM
#40


Thanks man for the information , can you please fix memory & ampere issue? is it possible ? and recompile it as i am unable to compile it via pure basic , free version have limitation
I can only add code to get correct number of ampere cores.
I can`t fix memory(it is more fix return 32bit values instead 64bit) because i can`t use unofficial _v2 comands with official commands in the same app.


ahan Sad,   i am not good at cuda or in programing , but if i use -i in kangaroo , it is returning correct parameters of memory.

is it possible to mix some codes from kangaroo side ? or any way to hardcode memory ?
sr. member
Activity: 652
Merit: 316
October 15, 2021, 05:31:56 AM
#39


Thanks man for the information , can you please fix memory & ampere issue? is it possible ? and recompile it as i am unable to compile it via pure basic , free version have limitation
I can only add code to get correct number of ampere cores.
I can`t fix memory(it is more fix return 32bit values instead 64bit) because i can`t use unofficial _v2 comands with official commands in the same app.
jr. member
Activity: 40
Merit: 7
October 15, 2021, 05:27:15 AM
#38
i think i found the problem the information which program is pulling from device is wrong or these are max value which intentionally hardcoded in program , Ethar can you please set all dynamic , i mean device should report all parameters

Found 1 Cuda device.
Cuda device:GeForce RTX 3080(4095Mb)    wrong
Device have: MP:68 Cores+0                    wrong
Shared memory total:49152                      i guess this is system memory but avaiable is 128GB
Constant memory total:65536                    not sure how calculate this one

i am not sure but MP is unit of AMD cards and cuda for Nvidia , and cuda is 8k+ in 3080 but not sure what is 68 cores here
so many confusions
Program used cuda driver api(not runtime api that ussualy used) and code for GPU writed on ptx.
cuda.lib that used to call cuda driver api even x64 version alwayse return 32bit values.
In that case you can`t use/allocate GPU memory more than 2**32bytes
Also cuDeviceTotalMem() return 32bit values of memory that is why you see 4095mb
I write about this issues to nvidia few times but according to them they have no problem)
if you are looking to cuda.lib you will fined unofficial commands like cuDeviceTotalMem_v2 and other.
All this commands have prefix _v2 and this comands return correct 64bit values.
But nvidia say that they does not have commands with prefix _v2 ))
It is about limitation of 2**32 bytes GPU memory
About Device have: MP:68 Cores+0, here 0 because i didn`t add Ampere to programm:
Code:
Case 2 ;Fermi
            Debug "Fermi"
            If minor=1
              cores = mp * 48
            Else
              cores = mp * 32
            EndIf
          Case 3; Kepler
            Debug "Kepler"
            cores = mp * 192
            
          Case 5; Maxwell
            Debug "Maxwell"
            cores = mp * 128
            
          Case 6; Pascal
            Debug "Pascal"
            cores = mp * 64
            
          Case 7; Pascal
            Debug "Pascal RTX"
            cores = mp * 64
          Default
            Debug "Unknown device type"
        EndSelect
by the way it need only for information and nothing more
to get corect number of cores need add only this
Code:
          Case 8; Ampere 
            Debug "Ampere RTX"
            cores = mp * 128
          Default
            Debug "Unknown device type"


Thanks man for the information , can you please fix memory & ampere issue? is it possible ? and recompile it as i am unable to compile it via pure basic , free version have limitation
sr. member
Activity: 652
Merit: 316
October 15, 2021, 05:24:17 AM
#37
i think i found the problem the information which program is pulling from device is wrong or these are max value which intentionally hardcoded in program , Ethar can you please set all dynamic , i mean device should report all parameters

Found 1 Cuda device.
Cuda device:GeForce RTX 3080(4095Mb)    wrong
Device have: MP:68 Cores+0                    wrong
Shared memory total:49152                      i guess this is system memory but avaiable is 128GB
Constant memory total:65536                    not sure how calculate this one

i am not sure but MP is unit of AMD cards and cuda for Nvidia , and cuda is 8k+ in 3080 but not sure what is 68 cores here
so many confusions
Program used cuda driver api(not runtime api that ussualy used) and code for GPU writed on ptx.
cuda.lib that used to call cuda driver api even x64 version alwayse return 32bit values.
In that case you can`t use/allocate GPU memory more than 2**32bytes
Also cuDeviceTotalMem() return 32bit values of memory that is why you see 4095mb
I write about this issues to nvidia few times but according to them they have no problem)
if you are looking to cuda.lib you will fined unofficial commands like cuDeviceTotalMem_v2 and other.
All this commands have prefix _v2 and this comands return correct 64bit values.
But nvidia say that they does not have commands with prefix _v2 ))
It is about limitation of 2**32 bytes GPU memory
About Device have: MP:68 Cores+0, here 0 because i didn`t add Ampere to programm:
Code:
Case 2 ;Fermi
            Debug "Fermi"
            If minor=1
              cores = mp * 48
            Else
              cores = mp * 32
            EndIf
          Case 3; Kepler
            Debug "Kepler"
            cores = mp * 192
            
          Case 5; Maxwell
            Debug "Maxwell"
            cores = mp * 128
            
          Case 6; Pascal
            Debug "Pascal"
            cores = mp * 64
            
          Case 7; Pascal
            Debug "Pascal RTX"
            cores = mp * 64
          Default
            Debug "Unknown device type"
        EndSelect
by the way it need only for information and nothing more
to get corect number of cores need add only this
Code:
          Case 8; Ampere 
            Debug "Ampere RTX"
            cores = mp * 128
          Default
            Debug "Unknown device type"
jr. member
Activity: 40
Merit: 7
October 15, 2021, 05:21:44 AM
#36
i agree with you but free purebasic program can compile only small code lines so that's why i need help from @Etar

and program is setting memory automatically but calculating it wrong
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 15, 2021, 05:11:54 AM
#35
i think i found the problem the information which program is pulling from device is wrong or these are max value which intentionally hardcoded in program , Ethar can you please set all dynamic , i mean device should report all parameters

Found 1 Cuda device.
Cuda device:GeForce RTX 3080(4095Mb)    wrong
Device have: MP:68 Cores+0                     wrong
Shared memory total:49152                      i guess this is system memory but avaiable is 128GB
Constant memory total:65536                    not sure how calculate this one

i am not sure but MP is unit of AMD cards and cuda for Nvidia , and cuda is 8k+ in 3080 but not sure what is 68 cores here
so many confusions

There is no need to wait for a patch, you can independently get these stats on an NVIDIA card using their sample DeviceQuery program: https://github.com/NVIDIA/cuda-samples/blob/master/Samples/deviceQuery/deviceQuery.cpp - It needs to be compiled from source though but it's extremely easy to do since it's only a single file.
jr. member
Activity: 40
Merit: 7
October 15, 2021, 03:29:17 AM
#34
i think i found the problem the information which program is pulling from device is wrong or these are max value which intentionally hardcoded in program , Ethar can you please set all dynamic , i mean device should report all parameters

Found 1 Cuda device.
Cuda device:GeForce RTX 3080(4095Mb)    wrong
Device have: MP:68 Cores+0                     wrong
Shared memory total:49152                      i guess this is system memory but avaiable is 128GB
Constant memory total:65536                    not sure how calculate this one

i am not sure but MP is unit of AMD cards and cuda for Nvidia , and cuda is 8k+ in 3080 but not sure what is 68 cores here
so many confusions
jr. member
Activity: 40
Merit: 7
October 15, 2021, 03:24:12 AM
#33
speed is also slower than Kangaroo around 1200M i am getting , but i want to tweak to utilize max gpu memory and max ram with max power , increase item size will slow down speed and take longer to solve .

any idea how to tweak 

Possibly due to "memory fragmentation" that happens when the program allocates GPU memory for one stuct, it's allocated in the middle of GPU memory and that will limit the maximum contiguous memory allocation allowed on the GPU for other structs.

The resolution for it is to allocate the largest structure first (in this case the TotalBuff) and then the smaller ones last. It requires a code modification though, which is impossible to do without the source code.

source codes are available i guess here https://github.com/Etayson/BSGS-cuda/blob/main/bsgscudaussualHTchangeble1_2.pb can you check please
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
October 15, 2021, 02:52:45 AM
#32
speed is also slower than Kangaroo around 1200M i am getting , but i want to tweak to utilize max gpu memory and max ram with max power , increase item size will slow down speed and take longer to solve .

any idea how to tweak 

Possibly due to "memory fragmentation" that happens when the program allocates GPU memory for one stuct, it's allocated in the middle of GPU memory and that will limit the maximum contiguous memory allocation allowed on the GPU for other structs.

The resolution for it is to allocate the largest structure first (in this case the TotalBuff) and then the smaller ones last. It requires a code modification though, which is impossible to do without the source code.
jr. member
Activity: 40
Merit: 7
October 15, 2021, 02:37:15 AM
#31
GPU #0 launched
GPU #0 TotalBuff: 8112.000Mb
error cuMemAlloc-2
Press Enter to exit

i guess you hard coded 4096 GPU mem as i did everything but i am unable utilizing full GPU memory  , my GPU is 3080 with 10GB

this is the max i can use

GPU #0 launched
GPU #0 TotalBuff: 3216.000Mb

      

speed is also slower than Kangaroo around 1200M i am getting , but i want to tweak to utilize max gpu memory and max ram with max power , increase item size will slow down speed and take longer to solve .

any idea how to tweak 
member
Activity: 72
Merit: 43
October 14, 2021, 05:08:14 AM
#30
i dont have 3xxx series available but based on specs i can calculate the average speed.
with one 3090 or 3080ti 2^51 operations should be done in 6-7 days
//edit
based on your previous post your tesla k80 will find (if lucky) a private key, if it's in range 100bit, in ~ 25-26 days
newbie
Activity: 9
Merit: 0
October 14, 2021, 04:45:59 AM
#29
Is it really possible to find a 100-bit key on one video card? How long does it take for this?
as i see 100bit puzzle was picked by telaurist who write first kangaroo ver in cpu, and he used 1 gpu to find it
maybe latest cards do it fast

It's quite possible to find 100bit puzzle with single video card and not even the most powerful one. (kangaroo method)
On single RTX 2060 you can find such a key in 34-35 days (2^51 operations). Sometimes you dont even need full 2^51, you can find the key even when you reach 2^50 (this means half of time ~17 days).
If we are talking about RTX 2080 then the speed is higher with almost 50% compared to 2060, this leads us to ~23 days for full 2^51 range.
with rtx 3xxx series maybe do it in hours ?
above 2 random key generate, one from first half and 2nd is 2nd half of 100 bit, i want to know how much fast rtx 3xxx series could found, i need to calc times, if you have rtx and have some time , to find above pubkeys in 100 bit, will help me to 3xxx power for time
thankx
member
Activity: 72
Merit: 43
October 14, 2021, 04:25:35 AM
#28
Is it really possible to find a 100-bit key on one video card? How long does it take for this?
as i see 100bit puzzle was picked by telaurist who write first kangaroo ver in cpu, and he used 1 gpu to find it
maybe latest cards do it fast

It's quite possible to find 100bit puzzle with single video card and not even the most powerful one. (kangaroo method)
On single RTX 2060 you can find such a key in 34-35 days (2^51 operations). Sometimes you dont even need full 2^51, you can find the key even when you reach 2^50 (this means half of time ~17 days).
If we are talking about RTX 2080 then the speed is higher with almost 50% compared to 2060, this leads us to ~23 days for full 2^51 range.
Pages:
Jump to: