Pages:
Author

Topic: Solving ECDLP with Kangaroos: Part 1 + 2 + RCKangaroo - page 4. (Read 3351 times)

newbie
Activity: 7
Merit: 0
That's why I don't want to support old cards: if I support them officially but not optimize you will blame me that they have bad speed.
But feel free to modify/optimize sources for your hardware Smiley
I'll be honest, your kangaroo finds the key faster than mine or jlp. Yes, the speed shows less, but in the end it finds it much faster.
Works even on 1660 super (~600Mkeys/s).
Thanks for sharing.

You can improve it in many ways.
For example, since L2 is useless for old cards, disable setting persistent part of L2 and set
#define PNT_GROUP_CNT      48
and change these lines in KernelB:

Code:
//calc original kang_ind
u32 tind = (THREAD_X + gr_ind2 * BLOCK_SIZE); //0..3071
u32 warp_ind = tind / (32 * PNT_GROUP_CNT / 2); // 0..7
u32 thr_ind = (tind / 4) % 32; //index in warp 0..31
u32 g8_ind = (tind % (32 * PNT_GROUP_CNT / 2)) / 128; // 0..2
u32 gr_ind = 2 * (tind % 4); // 0, 2, 4, 6

May I ask why, my 4060ti graphics card has a speed of just over 2000
CUDA devices: 1, CUDA driver/runtime: 12.6/12.5
GPU 0: NVIDIA GeForce RTX 4060 Ti, 16.00 GB, 34 CUs, cap 8.9, PCI 1, L2 size: 32768 KB
Total GPUs for work: 1
Solving point: Range 76 bits, DP 16, start...
SOTA method, estimated ops: 2^38.202, RAM for DPs: 0.367 GB. DP and GPU overheads not included!
Estimated DPs per kangaroo: 23.090.
GPU 0: allocated 1187 MB, 208896 kangaroos.
GPUs started...
MAIN: Speed: 2332 MKeys/s, Err: 0, DPs: 345K/4823K, Time: 0d:00h:00m, Est: 0d:00h:02m
MAIN: Speed: 2320 MKeys/s, Err: 0, DPs: 704K/4823K, Time: 0d:00h:00m, Est: 0d:00h:02m

Do you expect better speed? Why? 4090 has 128 CUs, 4060ti only 34.

Hello, can you tell me in which file you can find L2 and what you have to deactivate?
Thank you

I found it GpuKang.cpp that
Is that right there?
Quote
//allocate gpu mem
   //L2   
   int L2size = KangCnt * (3 * 32);
   total_mem += L2size;
   err = cudaMalloc((void**)&Kparams.L2, L2size);
   if (err != cudaSuccess)
   {
      printf("GPU %d, Allocate L2 memory failed: %s\n", CudaIndex, cudaGetErrorString(err));
      return false;
   }
?
Activity: -
Merit: -
That's why I don't want to support old cards: if I support them officially but not optimize you will blame me that they have bad speed.
But feel free to modify/optimize sources for your hardware Smiley
I'll be honest, your kangaroo finds the key faster than mine or jlp. Yes, the speed shows less, but in the end it finds it much faster.
Works even on 1660 super (~600Mkeys/s).
Thanks for sharing.

You can improve it in many ways.
For example, since L2 is useless for old cards, disable setting persistent part of L2 and set
#define PNT_GROUP_CNT      32
and change these lines in KernelB:

Code:
//calc original kang_ind
u32 tind = (THREAD_X + gr_ind2 * BLOCK_SIZE); //0..3071
u32 warp_ind = tind / (32 * PNT_GROUP_CNT / 2); // 0..7
u32 thr_ind = (tind / 4) % 32; //index in warp 0..31
u32 g8_ind = (tind % (32 * PNT_GROUP_CNT / 2)) / 128; // 0..2
u32 gr_ind = 2 * (tind % 4); // 0, 2, 4, 6

May I ask why, my 4060ti graphics card has a speed of just over 2000
CUDA devices: 1, CUDA driver/runtime: 12.6/12.5
GPU 0: NVIDIA GeForce RTX 4060 Ti, 16.00 GB, 34 CUs, cap 8.9, PCI 1, L2 size: 32768 KB
Total GPUs for work: 1
Solving point: Range 76 bits, DP 16, start...
SOTA method, estimated ops: 2^38.202, RAM for DPs: 0.367 GB. DP and GPU overheads not included!
Estimated DPs per kangaroo: 23.090.
GPU 0: allocated 1187 MB, 208896 kangaroos.
GPUs started...
MAIN: Speed: 2332 MKeys/s, Err: 0, DPs: 345K/4823K, Time: 0d:00h:00m, Est: 0d:00h:02m
MAIN: Speed: 2320 MKeys/s, Err: 0, DPs: 704K/4823K, Time: 0d:00h:00m, Est: 0d:00h:02m

Do you expect better speed? Why? 4090 has 128 CUs, 4060ti only 34.
sr. member
Activity: 652
Merit: 316
Did you make a mistake?
I use visual studio. Oh, sorry it is in file RCKangaroo.cpp line 99 (changed previous post)
Code:
if (deviceProp.major < 6)
{
printf("GPU %d - not supported, skip\r\n", i);
continue;
}
I also made other changes, which is probably why the line numbers don't match. But you can easily find the lines you need, because I didn't add anything new, I just changed it.
newbie
Activity: 7
Merit: 0
@Etar how did you do everything to make your GTX 1660 work? Can you tell me all the changes and show me them? Many thanks
file: RCGpuCore.cu
line 285: u64* table = LDS + 8 * JMP_CNT + 16 * THREAD_X;
Line 99: if (deviceProp.major < 6)
file: defs.h
#define LDS_SIZE_A         (64 * 1024)
#define LDS_SIZE_B         (64 * 1024)
#define LDS_SIZE_C         (64 * 1024)
#define JMP_CNT            512
file: RCKangaroo.vcxproj
line 118: compute_75,sm_75;compute_75,sm_75
line 141: compute_75,sm_75;compute_75,sm_75

Code:
CUDA devices: 1, CUDA driver/runtime: 12.6/12.1
GPU 0: NVIDIA GeForce GTX 1660 SUPER, 6.00 GB, 22 CUs, cap 7.5, PCI 1, L2 size: 1536 KB
Total GPUs for work: 1

MAIN MODE

Solving public key
X: 29C4574A4FD8C810B7E42A4B398882B381BCD85E40C6883712912D167C83E73A
Y: 0E02C3AFD79913AB0961C95F12498F36A72FFA35C93AF27CEE30010FA6B51C53
Offset: 0000000000000000000000000000000000000000001000000000000000000000

Solving point: Range 84 bits, DP 16, start...
SOTA method, estimated ops: 2^42.202, RAM for DPs: 3.062 GB. DP and GPU overheads not included!
Estimated DPs per kangaroo: 570.958.
GPU 0: allocated 772 MB, 135168 kangaroos.
GPUs started...
MAIN: Speed: 599 MKeys/s, Err: 0, DPs: 88K/77175K, Time: 0d:00h:00m:10s, Est: 0d:02h:20m:43s

file: RCGpuCore.cu
Line 99 looks like this
SubModP(tmp, x, jmp_x);

If you replace it with
if (deviceProp.major < 6)

This error occurs when creating:
RCGpuCore.cu(99): error: identifier "deviceProp" is undefined
     if (deviceProp.major < 6)
         ^

1 error detected in the compilation of "RCGpuCore.cu".
make: *** [Makefile:26: RCGpuCore.o] Error 2

Did you make a mistake?

Thank you for your effort
newbie
Activity: 2
Merit: 0
how to use it with RTX2070super, please, because I only have 2070, I am very interested in testing your work. I tried modifying the parameter settings but it failed.

As far as I remember these cards have only 64KB of shared memory.
Set JMP_CNT to 512 and change 17 to 16 in this line in KernelB:
u64* table = LDS + 8 * JMP_CNT + 17 * THREAD_X;
and recalculate LDS_SIZE_ constants.
I think it's enough, though may be I forgot something...
The main issue is not in compiling, but in optimizations, my code is not for 20xx and 30xx cards, so it won't work with good speed there.
That's why I don't want to support old cards: if I support them officially but not optimize you will blame me that they have bad speed.
But feel free to modify/optimize sources for your hardware Smiley
Thank you very much. Following your suggestion, I modified LDS_SIZE_ constants and successfully ran it perfectly on my 2070. Thanks again for your artistic work. By comparing the test, it is faster to find the private key than JLP's code. Next, I will try to understand your code and optimize it with the help of Claude 3.5 to increase the speed. Maybe you have more good suggestions.
newbie
Activity: 6
Merit: 0
how to use it with RTX2070super, please, because I only have 2070, I am very interested in testing your work. I tried modifying the parameter settings but it failed.

As far as I remember these cards have only 64KB of shared memory.
Set JMP_CNT to 512 and change 17 to 16 in this line in KernelB:
u64* table = LDS + 8 * JMP_CNT + 17 * THREAD_X;
and recalculate LDS_SIZE_ constants.
I think it's enough, though may be I forgot something...
The main issue is not in compiling, but in optimizations, my code is not for 20xx and 30xx cards, so it won't work with good speed there.
That's why I don't want to support old cards: if I support them officially but not optimize you will blame me that they have bad speed.
But feel free to modify/optimize sources for your hardware Smiley


May I ask why, my 4060ti graphics card has a speed of just over 2000



CUDA devices: 1, CUDA driver/runtime: 12.6/12.5
GPU 0: NVIDIA GeForce RTX 4060 Ti, 16.00 GB, 34 CUs, cap 8.9, PCI 1, L2 size: 32768 KB
Total GPUs for work: 1
Solving point: Range 76 bits, DP 16, start...
SOTA method, estimated ops: 2^38.202, RAM for DPs: 0.367 GB. DP and GPU overheads not included!
Estimated DPs per kangaroo: 23.090.
GPU 0: allocated 1187 MB, 208896 kangaroos.
GPUs started...
MAIN: Speed: 2332 MKeys/s, Err: 0, DPs: 345K/4823K, Time: 0d:00h:00m, Est: 0d:00h:02m
MAIN: Speed: 2320 MKeys/s, Err: 0, DPs: 704K/4823K, Time: 0d:00h:00m, Est: 0d:00h:02m
sr. member
Activity: 652
Merit: 316
@Etar how did you do everything to make your GTX 1660 work? Can you tell me all the changes and show me them? Many thanks
file: RCGpuCore.cu
line 285: u64* table = LDS + 8 * JMP_CNT + 16 * THREAD_X;
file: RCKangaroo.cpp
Line 99: if (deviceProp.major < 6)
file: defs.h
#define LDS_SIZE_A         (64 * 1024)
#define LDS_SIZE_B         (64 * 1024)
#define LDS_SIZE_C         (64 * 1024)
#define JMP_CNT            512
file: RCKangaroo.vcxproj
line 118: compute_75,sm_75;compute_75,sm_75
line 141: compute_75,sm_75;compute_75,sm_75

Code:
CUDA devices: 1, CUDA driver/runtime: 12.6/12.1
GPU 0: NVIDIA GeForce GTX 1660 SUPER, 6.00 GB, 22 CUs, cap 7.5, PCI 1, L2 size: 1536 KB
Total GPUs for work: 1

MAIN MODE

Solving public key
X: 29C4574A4FD8C810B7E42A4B398882B381BCD85E40C6883712912D167C83E73A
Y: 0E02C3AFD79913AB0961C95F12498F36A72FFA35C93AF27CEE30010FA6B51C53
Offset: 0000000000000000000000000000000000000000001000000000000000000000

Solving point: Range 84 bits, DP 16, start...
SOTA method, estimated ops: 2^42.202, RAM for DPs: 3.062 GB. DP and GPU overheads not included!
Estimated DPs per kangaroo: 570.958.
GPU 0: allocated 772 MB, 135168 kangaroos.
GPUs started...
MAIN: Speed: 599 MKeys/s, Err: 0, DPs: 88K/77175K, Time: 0d:00h:00m:10s, Est: 0d:02h:20m:43s
newbie
Activity: 7
Merit: 0
I'll be honest, your kangaroo finds the key faster than mine or jlp. Yes, the speed shows less, but in the end it finds it much faster.
Works even on 1660 super (~600Mkeys/s).
Thanks for sharing.

@Etar how did you do everything to make your GTX 1660 work? Can you tell me all the changes and show me them? Many thanks
sr. member
Activity: 652
Merit: 316
That's why I don't want to support old cards: if I support them officially but not optimize you will blame me that they have bad speed.
But feel free to modify/optimize sources for your hardware Smiley
I'll be honest, your kangaroo finds the key faster than mine or jlp. Yes, the speed shows less, but in the end it finds it much faster.
Works even on 1660 super (~600Mkeys/s).
Thanks for sharing.
?
Activity: -
Merit: -
how to use it with RTX2070super, please, because I only have 2070, I am very interested in testing your work. I tried modifying the parameter settings but it failed.

As far as I remember these cards have only 64KB of shared memory.
Set JMP_CNT to 512 and change 17 to 16 in this line in KernelB:
u64* table = LDS + 8 * JMP_CNT + 17 * THREAD_X;
and recalculate LDS_SIZE_ constants.
I think it's enough, though may be I forgot something...
The main issue is not in compiling, but in optimizations, my code is not for 20xx and 30xx cards, so it won't work with good speed there.
That's why I don't want to support old cards: if I support them officially but not optimize you will blame me that they have bad speed.
But feel free to modify/optimize sources for your hardware Smiley
newbie
Activity: 2
Merit: 0
how to use it with RTX2070super, please, because I only have 2070, I am very interested in testing your work. I tried modifying the parameter settings but it failed.
copper member
Activity: 5
Merit: 0
this is my version of make file

PROJECT = RCKangaroo
CC = g++
NVCC = nvcc
CFLAGS = -Wall -O2
INCLUDES = -I/usr/local/cuda/include 
LIBDIRS = -L/usr/local/cuda/lib64   
LIBS_CUDA = -lcudart               

# File sorgente
CPP_FILES = $(wildcard *.cpp)
CU_FILES = $(wildcard *.cu)
OBJECTS = $(CPP_FILES:.cpp=.o) $(CU_FILES:.cu=.o)

all: $(PROJECT)

$(PROJECT): $(OBJECTS)
   $(NVCC) $(OBJECTS) -o $(PROJECT) $(LIBDIRS) $(LIBS_CUDA)

%.o: %.cpp
   $(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@

%.o: %.cu
   $(NVCC) -O2 $(INCLUDES) -c $< -o $@

clean:
   rm -f $(PROJECT) $(OBJECTS)
newbie
Activity: 22
Merit: 1
Ok, it's better to support 30xx cards even if RCKangaroo is not optimized for them, so I released v1.1.
3090 shows about 3GKeys/sec only, it can be really faster.

Thank you so much, any advice for Linux users?

Linux exe is included as well, or you can compile it by yourself.

Really? "3090 shows about 3GKeys/sec only" this is a record man, I can believe! I just can say Thanks!


have a Makefile compatible with linux ?


Sure

Code:
CC = g++
NVCC = nvcc
CFLAGS = -c -O3 -g  # Added optimization and debugging
LDFLAGS = -L/usr/local/cuda/lib64 -lcudart -I/usr/local/cuda/include
ARCH_FLAGS = -arch=sm_86  # Explicitly specify the architecture for RTX 3060
OBJ = RCGpuCore.o Ec.o GpuKang.o RCKangaroo.o utils.o
TARGET = RCKangaroo

all: $(TARGET)

$(TARGET): $(OBJ)
$(CC) -o $(TARGET) $(OBJ) $(LDFLAGS)

RCGpuCore.o: RCGpuCore.cu
$(NVCC) $(CFLAGS) $(ARCH_FLAGS) RCGpuCore.cu

Ec.o: Ec.cpp
$(CC) $(CFLAGS) Ec.cpp

GpuKang.o: GpuKang.cpp
$(CC) $(CFLAGS) GpuKang.cpp

RCKangaroo.o: RCKangaroo.cpp
$(CC) $(CFLAGS) RCKangaroo.cpp

utils.o: utils.cpp
$(CC) $(CFLAGS) utils.cpp

clean:
rm -f *.o $(TARGET)
copper member
Activity: 5
Merit: 0
Ok, it's better to support 30xx cards even if RCKangaroo is not optimized for them, so I released v1.1.
3090 shows about 3GKeys/sec only, it can be really faster.

Thank you so much, any advice for Linux users?

Linux exe is included as well, or you can compile it by yourself.

Really? "3090 shows about 3GKeys/sec only" this is a record man, I can believe! I just can say Thanks!


have a Makefile compatible with linux ?
newbie
Activity: 22
Merit: 1
Ok, it's better to support 30xx cards even if RCKangaroo is not optimized for them, so I released v1.1.
3090 shows about 3GKeys/sec only, it can be really faster.

Thank you so much, any advice for Linux users?

Linux exe is included as well, or you can compile it by yourself.

Really? "3090 shows about 3GKeys/sec only" this is a record man, I can believe! I just can say Thanks!
?
Activity: -
Merit: -
Ok, it's better to support 30xx cards even if RCKangaroo is not optimized for them, so I released v1.1.
3090 shows about 3GKeys/sec only, it can be really faster.

Thank you so much, any advice for Linux users?

Linux exe is included as well, or you can compile it by yourself.
newbie
Activity: 22
Merit: 1
As promised, here is the third and final part: RCKangaroo, Windows/Linux, open source:
https://github.com/RetiredC/RCKangaroo
This software demonstrates fast implementation of SOTA method and advanced loop handling on RTX40xx cards.
Note that I have not included all possible optimizations because it's public code and I want to keep it as simple/readable as possible. Anyway, it's fast enough to demonstrate the advantage and you can improve it further if you have enough skills.

Thank you so much, any advice for Linux users?
newbie
Activity: 1
Merit: 0
Executing the puzzle #85 on RTX 4060

Quote
CUDA devices: 1, CUDA driver/runtime: 12.7/12.6
GPU 0: NVIDIA GeForce RTX 4060 Laptop GPU, 8.00 GB, 24 CUs, cap 8.9, PCI 1, L2 size: 32768 KB
Total GPUs for work: 1

MAIN MODE

Solving public key
X: 29C4574A4FD8C810B7E42A4B398882B381BCD85E40C6883712912D167C83E73A
Y: 0E02C3AFD79913AB0961C95F12498F36A72FFA35C93AF27CEE30010FA6B51C53
Offset: 0000000000000000000000000000000000000000001000000000000000000000

Solving point: Range 84 bits, DP 16, start...
SOTA method, estimated ops: 2^42.202, RAM for DPs: 3.062 GB. DP and GPU overheads not included!
Estimated DPs per kangaroo: 523.378.
GPU 0: allocated 841 MB, 147456 kangaroos.
GPUs started...
MAIN: Speed: 1180 MKeys/s, Err: 0, DPs: 174K/77175K, Time: 0d:00h:00m, Est: 0d:01h:11m
...
MAIN: Speed: 1132 MKeys/s, Err: 0, DPs: 154839K/77175K, Time: 0d:02h:28m, Est: 0d:01h:14m
Stopping work ...
Point solved, K: 2.310 (with DP and GPU overheads)


PRIVATE KEY: 00000000000000000000000000000000000000000011720C4F018D51B8CEBBA8

I compare the number of operations with JLP:

Quote
Kangaroo v2.1
Start:1000000000000000000000
Stop :1FFFFFFFFFFFFFFFFFFFFF
Keys :1
Number of CPU thread: 0
Range width: 2^84
Jump Avg distance: 2^42.03
Number of kangaroos: 2^23.32
Suggested DP: 16
Expected operations: 2^43.12
Expected RAM: 6347.6MB
DP size: 16 [0xFFFF000000000000]

Quick comparison, its using the half of the RAM used with the same DP and less operations
?
Activity: -
Merit: -
Some explanations about other GPUs support:
1. I have zero interest in old cards (same for AMD cards) so I don't have them for development/tests and don't support them.
2. You can easily enable support for older nvidia cards, it will work, but my code is designed for the latest generation, for previous generations it's not optimal and the speed is not the best, that's why I disabled them.
newbie
Activity: 1
Merit: 0
Thanks for sharing your repositories and source code.

Great job!
Pages:
Jump to: