But feel free to modify/optimize sources for your hardware
Works even on 1660 super (~600Mkeys/s).
Thanks for sharing.
You can improve it in many ways.
For example, since L2 is useless for old cards, disable setting persistent part of L2 and set
#define PNT_GROUP_CNT 48
and change these lines in KernelB:
u32 tind = (THREAD_X + gr_ind2 * BLOCK_SIZE); //0..3071
u32 warp_ind = tind / (32 * PNT_GROUP_CNT / 2); // 0..7
u32 thr_ind = (tind / 4) % 32; //index in warp 0..31
u32 g8_ind = (tind % (32 * PNT_GROUP_CNT / 2)) / 128; // 0..2
u32 gr_ind = 2 * (tind % 4); // 0, 2, 4, 6
CUDA devices: 1, CUDA driver/runtime: 12.6/12.5
GPU 0: NVIDIA GeForce RTX 4060 Ti, 16.00 GB, 34 CUs, cap 8.9, PCI 1, L2 size: 32768 KB
Total GPUs for work: 1
Solving point: Range 76 bits, DP 16, start...
SOTA method, estimated ops: 2^38.202, RAM for DPs: 0.367 GB. DP and GPU overheads not included!
Estimated DPs per kangaroo: 23.090.
GPU 0: allocated 1187 MB, 208896 kangaroos.
GPUs started...
MAIN: Speed: 2332 MKeys/s, Err: 0, DPs: 345K/4823K, Time: 0d:00h:00m, Est: 0d:00h:02m
MAIN: Speed: 2320 MKeys/s, Err: 0, DPs: 704K/4823K, Time: 0d:00h:00m, Est: 0d:00h:02m
Do you expect better speed? Why? 4090 has 128 CUs, 4060ti only 34.
Hello, can you tell me in which file you can find L2 and what you have to deactivate?
Thank you
I found it GpuKang.cpp that
Is that right there?
//L2
int L2size = KangCnt * (3 * 32);
total_mem += L2size;
err = cudaMalloc((void**)&Kparams.L2, L2size);
if (err != cudaSuccess)
{
printf("GPU %d, Allocate L2 memory failed: %s\n", CudaIndex, cudaGetErrorString(err));
return false;
}