The lyra2v2 kernal has been split into three passes instead of one pass. I guess to reduce the register pressure.
I just did a comparison on the 750ti with sp-mod private #6 (-i 18)
My cubehash.blakekeccak256,skein,bmw kernals are all faster, but the lyra2v2 kernal is 44% slower.
sp-mod private: 4675
Lyra2v2-Nicehash: 6100 (30% faster)
Not bad, I think the work has been done by djm34.
==3632== Profiling application: ccminer_sp_lyra2v2.exe -a lyra2v2 -i 18
==3632== Profiling result:
Time(%) Time Calls Avg Min Max Name
46.87% 2.75600s 143 19.273ms 19.181ms 21.679ms lyra2v2_gpu_hash_32_2(unsigned int, unsigned int, __int64*)
35.42% 2.08255s 286 7.2817ms 7.2457ms 8.2117ms cubehash256_gpu_hash_32(unsigned int, unsigned int, uint2*)
6.10% 358.73ms 143 2.5086ms 2.4916ms 2.8360ms blakeKeccak256_gpu_hash_80(unsigned int, unsigned int, unsigned int*)
4.33% 254.58ms 143 1.7803ms 1.7564ms 2.0057ms skein256_gpu_hash_32(unsigned int, unsigned int, __int64*)
3.66% 215.44ms 143 1.5066ms 1.4988ms 1.6974ms lyra2v2_gpu_hash_32_1(unsigned int, unsigned int, uint2*)
1.89% 111.38ms 143 778.90us 773.77us 877.74us lyra2v2_gpu_hash_32_3(unsigned int, unsigned int, uint2*)
1.71% 100.41ms 143 702.18us 698.09us 792.84us bmw256_gpu_hash_32(unsigned int, unsigned int, uint2*, unsigned int*, unsigned int)
0.01% 703.33us 143 4.9180us 4.5440us 7.1050us [CUDA memcpy DtoH]
0.01% 470.55us 143 3.2900us 3.1040us 7.6490us [CUDA memset]
0.00% 11.744us 7 1.6770us 1.3120us 3.7440us [CUDA memcpy HtoD]
sp-mod private #6:
==820== Profiling application: ccminer.exe -a lyra2v2 --benchmark -i 18
==820== Profiling result:
Time(%) Time Calls Avg Min Max Name
68.70% 5.65170s 147 38.447ms 38.297ms 43.242ms lyra2v2_gpu_hash_32(unsigned int, unsigned int, uint2*)
22.94% 1.88714s 294 6.4188ms 6.4047ms 7.2165ms cubehash256_gpu_hash_32(unsigned int, unsigned int, uint2*)
3.96% 325.75ms 147 2.2160ms 2.2060ms 2.4968ms blakeKeccak256_gpu_hash_80(unsigned int, unsigned int, unsigned int*)
3.21% 263.95ms 147 1.7956ms 1.7774ms 2.0206ms skein256_gpu_hash_32(unsigned int, unsigned int, __int64*)
1.18% 97.455ms 147 662.96us 659.58us 745.59us bmw256_gpu_hash_32(unsigned int, unsigned int, uint2*, unsigned int*, unsigned int)
0.00% 385.64us 147 2.6230us 2.5270us 3.7110us [CUDA memset]
0.00% 260.56us 147 1.7720us 1.5680us 3.1360us [CUDA memcpy DtoH]
0.00% 6.2390us 7 891ns 704ns 1.3440us [CUDA memcpy HtoD]