Here is a comparison of the two bins:
==4948== Profiling application: ccminer_cuda65_nosync.exe -a neoscrypt --benchmark
==5848== Profiling result:
Time(%) Time Calls Avg Min Max Name
26.59% 18.4119s 796 23.131ms 15.087ms 47.744ms neoscrypt_gpu_hash_chacha1_stream1(int, unsigned int)
26.22% 18.1600s 796 22.814ms 14.943ms 48.441ms neoscrypt_gpu_hash_salsa1_stream1(int, unsigned int)
19.68% 13.6279s 796 17.121ms 13.557ms 24.808ms neoscrypt_gpu_hash_salsa2_stream1(int, unsigned int)
18.19% 12.5939s 796 15.822ms 12.835ms 23.530ms neoscrypt_gpu_hash_chacha2_stream1(int, unsigned int)
4.66% 3.22608s 796 4.0529ms 3.3128ms 5.7348ms neoscrypt_gpu_hash_ending(int, int, unsigned int, unsigned int*)
4.66% 3.22443s 796 4.0508ms 3.3958ms 6.4411ms neoscrypt_gpu_hash_start(int, int, unsigned int)
==5848== Profiling application: ccminerneo.exe -a neoscrypt --benchmark
==5848== Profiling result:
Time(%) Time Calls Avg Min Max Name
25.69% 26.1063s 1164 22.428ms 15.408ms 58.151ms neoscrypt_gpu_hash_salsa1_stream1(int, unsigned int)
25.62% 26.0341s 1164 22.366ms 15.171ms 55.874ms neoscrypt_gpu_hash_chacha1_stream1(int, unsigned int)
19.71% 20.0311s 1164 17.209ms 13.479ms 29.562ms neoscrypt_gpu_hash_chacha2_stream1(int, unsigned int)
19.43% 19.7487s 1164 16.966ms 13.272ms 31.282ms neoscrypt_gpu_hash_salsa2_stream1(int, unsigned int)
4.80% 4.88065s 1164 4.1930ms 3.2593ms 6.5728ms neoscrypt_gpu_hash_ending(int, int, unsigned int, unsigned int*)
4.74% 4.82073s 1164 4.1415ms 3.3896ms 7.0174ms neoscrypt_gpu_hash_start(int, int, unsigned int)
Here is the full run:
C:\Boss\>nvprof ccminer_cuda65_nosync.exe -a neoscrypt --benchmark
Compiled with Visual C++ 18 using Nvidia CUDA Toolkit 6.5
Based on pooler cpuminer 2.3.2 and the tpruvot@github fork
CUDA support by Christian Buchner, Christian H. and DJM34
Includes optimizations implemented by sp, klaust, tpruvot, tsiv and pallas.
==4948== NVPROF is profiling process 4948, command: ccminer_cuda65_nosync.exe -a neoscrypt --benchmark
[2016-01-24 22:56:32] NVAPI GPU monitoring enabled.
[2016-01-24 22:56:32] Binding thread 0 to cpu 0 (mask 1)
[2016-01-24 22:56:32] Binding thread 2 to cpu 0 (mask 4)
[2016-01-24 22:56:32] Binding thread 3 to cpu 1 (mask 8)
[2016-01-24 22:56:32] Binding thread 1 to cpu 1 (mask 2)
[2016-01-24 22:56:32] 5 miner threads started, using 'neoscrypt' algorithm.
[2016-01-24 22:56:32] Binding thread 4 to cpu 0 (mask 10)
[2016-01-24 22:56:34] GPU #2 Found nounce 6667c303
[2016-01-24 22:56:34] GPU #0 Found nounce 000bbb86
[2016-01-24 22:56:35] CTRL_C_EVENT received, exiting once miner jobs complete. Ctrl+C again to abort miner jobs
[2016-01-24 22:56:36] CTRL_C_EVENT received, aborting miner jobs
[2016-01-24 22:56:38] GPU #4 Found nounce cce9ae79
[2016-01-24 22:56:43] GPU #2 Found nounce 6680eb21
[2016-01-24 22:56:43] GPU #2: GeForce GTX 750 Ti, 177
[2016-01-24 22:56:46] GPU #3 Found nounce 99bf3915
[2016-01-24 22:56:49] GPU #0 Found nounce 009955c5
[2016-01-24 22:56:49] GPU #0: GeForce GTX 970, 656
[2016-01-24 22:56:54] GPU #1 Found nounce 33baba0f
==4948== Profiling application: ccminer_cuda65_nosync.exe -a neoscrypt --benchmark
==4948== Profiling result:
Time(%) Time Calls Avg Min Max Name
26.59% 18.4119s 796 23.131ms 15.087ms 47.744ms neoscrypt_gpu_hash_chacha1_stream1(int, unsigned int)
26.22% 18.1600s 796 22.814ms 14.943ms 48.441ms neoscrypt_gpu_hash_salsa1_stream1(int, unsigned int)
19.68% 13.6279s 796 17.121ms 13.557ms 24.808ms neoscrypt_gpu_hash_salsa2_stream1(int, unsigned int)
18.19% 12.5939s 796 15.822ms 12.835ms 23.530ms neoscrypt_gpu_hash_chacha2_stream1(int, unsigned int)
4.66% 3.22608s 796 4.0529ms 3.3128ms 5.7348ms neoscrypt_gpu_hash_ending(int, int, unsigned int, unsigned int*)
4.66% 3.22443s 796 4.0508ms 3.3958ms 6.4411ms neoscrypt_gpu_hash_start(int, int, unsigned int)
0.00% 2.7009ms 796 3.3930us 2.8480us 7.6800us [CUDA memset]
0.00% 2.4198ms 796 3.0390us 1.9200us 5.4080us [CUDA memcpy DtoH]
0.00% 100.13us 58 1.7260us 800ns 4.1280us [CUDA memcpy HtoD]
==4948== API calls:
Time(%) Time Calls Avg Min Max Name
49.67% 11.0177s 796 13.841ms 1.3506ms 140.87ms cudaStreamSynchronize
23.54% 5.22234s 5 1.04447s 958.40ms 1.16171s cudaDeviceSetCacheConfig
17.14% 3.80248s 798 4.7650ms 147.49us 390.87ms cudaDeviceSynchronize
4.61% 1.02323s 35 29.235ms 474.01us 127.75ms cudaMalloc
1.36% 302.78ms 796 380.37us 123.16us 39.274ms cudaMemcpy
1.12% 247.50ms 17 14.559ms 6.7122ms 57.145ms cudaGetDeviceProperties
0.81% 179.27ms 4776 37.535us 17.486us 1.0408ms cudaLaunch
0.55% 121.59ms 1592 76.375us 38.392us 1.5406ms cudaStreamDestroy
0.39% 86.053ms 1 86.053ms 86.053ms 86.053ms cudaDeviceReset
0.22% 49.228ms 796 61.844us 32.310us 3.9833ms cudaMemset
0.21% 47.457ms 796 59.618us 35.351us 506.32us cudaStreamQuery
0.16% 34.620ms 415 83.422us 0ns 3.8434ms cuDeviceGetAttribute
0.12% 26.464ms 1592 16.622us 1.9010us 1.8466ms cudaStreamCreate
0.04% 8.1601ms 5 1.6320ms 1.6087ms 1.7045ms cuDeviceGetName
0.03% 7.6507ms 58 131.91us 34.591us 724.89us cudaMemcpyToSymbol
0.02% 4.2546ms 11940 356ns 0ns 740.10us cudaSetupArgument
0.01% 2.2253ms 4776 465ns 0ns 11.024us cudaConfigureCall
0.00% 42.193us 5 8.4380us 7.6020us 9.1230us cuDeviceTotalMem
0.00% 23.948us 5 4.7890us 4.5620us 4.9420us cudaSetDevice
0.00% 4.5620us 10 456ns 0ns 1.1410us cuDeviceGet
0.00% 1.9000us 2 950ns 760ns 1.1400us cudaDriverGetVersion
0.00% 1.5220us 2 761ns 381ns 1.1410us cuDeviceGetCount
0.00% 761ns 3 253ns 0ns 381ns cudaGetDeviceCount
C:\Boss\>nvprof ccminerneo.exe -a neoscrypt --benchmark
SP-Mod Private #5
Compiled with Visual C++ 18 using Nvidia CUDA Toolkit 7.5
Based on pooler cpuminer 2.3.2 and the tpruvot@github fork
CUDA support by Christian Buchner, Christian H. and DJM34
Includes optimizations implemented by sp, klaust, tpruvot, tsiv and pallas.
==5848== NVPROF is profiling process 5848, command: ccminer.exe -a neoscrypt --benchmark
[2016-01-24 22:57:29] NVAPI GPU monitoring enabled.
[2016-01-24 22:57:29] Binding thread 0 to cpu 0 (mask 1)
[2016-01-24 22:57:29] Binding thread 1 to cpu 1 (mask 2)
[2016-01-24 22:57:29] Binding thread 3 to cpu 1 (mask 8)
[2016-01-24 22:57:29] Binding thread 2 to cpu 0 (mask 4)
[2016-01-24 22:57:29] 5 miner threads started, using 'neoscrypt' algorithm.
[2016-01-24 22:57:29] Binding thread 4 to cpu 0 (mask 10)
[2016-01-24 22:57:31] GPU #2 Found nounce 6667c303
[2016-01-24 22:57:31] GPU #0 Found nounce 000bbb86
[2016-01-24 22:57:35] GPU #4 Found nounce cce9ae79
[2016-01-24 22:57:35] GPU #4 Found nounce ccee4c85
[2016-01-24 22:57:35] GPU #4: GeForce GTX 960, 387
[2016-01-24 22:57:36] GPU #0 Found nounce 00354a70
[2016-01-24 22:57:36] GPU #0: GeForce GTX 970, 650
[2016-01-24 22:57:36] GPU #0 Found nounce 003764ae
[2016-01-24 22:57:36] GPU #0: GeForce GTX 970, 532
[2016-01-24 22:57:36] GPU #4 Found nounce ccf3909e
[2016-01-24 22:57:36] GPU #4: GeForce GTX 960, 374
[2016-01-24 22:57:36] Total: 590.85 kH/s
[2016-01-24 22:57:38] GPU #3 Found nounce 99b08920
[2016-01-24 22:57:39] GPU #1 Found nounce 336afc90
[2016-01-24 22:57:40] GPU #2 Found nounce 6680eb21
[2016-01-24 22:57:40] GPU #2: GeForce GTX 750 Ti, 172
[2016-01-24 22:57:41] CTRL_C_EVENT received, exiting once miner jobs complete. Ctrl+C again to abort miner jobs
[2016-01-24 22:57:41] CTRL_C_EVENT received, aborting miner jobs
[2016-01-24 22:57:41] CTRL_C_EVENT received, aborting miner jobs
[2016-01-24 22:57:44] GPU #3 Found nounce 99bf3915
[2016-01-24 22:57:44] GPU #3: GeForce GTX 750 Ti, 177
[2016-01-24 22:57:46] GPU #0 Found nounce 009955c5
[2016-01-24 22:57:46] GPU #0: GeForce GTX 970, 655
[2016-01-24 22:57:47] GPU #2 Found nounce 6692f03e
[2016-01-24 22:57:47] GPU #2: GeForce GTX 750 Ti, 170
[2016-01-24 22:57:51] GPU #1 Found nounce 33baba0f
[2016-01-24 22:57:51] GPU #1: GeForce GTX 960, 415
[2016-01-24 22:57:58] CTRL_C_EVENT received, aborting miner jobs
[2016-01-24 22:57:59] CTRL_C_EVENT received, aborting miner jobs
[2016-01-24 22:57:59] GPU #4 Found nounce cd80fddc
[2016-01-24 22:57:59] GPU #4: GeForce GTX 960, 407
[2016-01-24 22:57:59] Total: 1734.83 kH/s
==5848== Profiling application: ccminer.exe -a neoscrypt --benchmark
==5848== Profiling result:
Time(%) Time Calls Avg Min Max Name
25.69% 26.1063s 1164 22.428ms 15.408ms 58.151ms neoscrypt_gpu_hash_salsa1_stream1(int, unsigned int)
25.62% 26.0341s 1164 22.366ms 15.171ms 55.874ms neoscrypt_gpu_hash_chacha1_stream1(int, unsigned int)
19.71% 20.0311s 1164 17.209ms 13.479ms 29.562ms neoscrypt_gpu_hash_chacha2_stream1(int, unsigned int)
19.43% 19.7487s 1164 16.966ms 13.272ms 31.282ms neoscrypt_gpu_hash_salsa2_stream1(int, unsigned int)
4.80% 4.88065s 1164 4.1930ms 3.2593ms 6.5728ms neoscrypt_gpu_hash_ending(int, int, unsigned int, unsigned int*)
4.74% 4.82073s 1164 4.1415ms 3.3896ms 7.0174ms neoscrypt_gpu_hash_start(int, int, unsigned int)
0.00% 3.9227ms 1164 3.3690us 2.8470us 7.8720us [CUDA memset]
0.00% 3.5539ms 1164 3.0530us 1.9190us 5.5690us [CUDA memcpy DtoH]
0.00% 151.10us 90 1.6780us 800ns 4.0960us [CUDA memcpy HtoD]
==5848== API calls:
Time(%) Time Calls Avg Min Max Name
92.54% 91.6359s 2330 39.329ms 88.568us 296.98ms cudaDeviceSynchronize
4.05% 4.01266s 5 802.53ms 686.43ms 1.01414s cudaDeviceSetCacheConfig
1.13% 1.11615s 1164 958.89us 17.106us 6.5145ms cudaStreamSynchronize
1.13% 1.11531s 35 31.866ms 493.78us 129.53ms cudaMalloc
0.33% 323.72ms 25 12.949ms 6.7658ms 43.151ms cudaGetDeviceProperties
0.23% 229.81ms 6984 32.905us 19.006us 1.8063ms cudaLaunch
0.18% 179.37ms 1164 154.10us 98.831us 699.80us cudaMemcpy
0.13% 131.66ms 1164 113.11us 48.655us 27.393ms cudaStreamQuery
0.09% 87.075ms 1 87.075ms 87.075ms 87.075ms cudaDeviceReset
0.08% 76.789ms 2328 32.984us 26.228us 264.56us cudaStreamDestroy
0.04% 39.028ms 1164 33.529us 23.947us 272.93us cudaMemset
0.04% 35.107ms 415 84.595us 0ns 3.7119ms cuDeviceGetAttribute
0.02% 16.680ms 90 185.33us 33.450us 2.3845ms cudaMemcpyToSymbol
0.01% 8.4858ms 5 1.6972ms 1.6140ms 1.8592ms cuDeviceGetName
0.01% 5.9565ms 17460 341ns 0ns 990.59us cudaSetupArgument
0.01% 5.7767ms 2328 2.4810us 760ns 508.60us cudaStreamCreate
0.00% 3.7187ms 6984 532ns 0ns 808.14us cudaConfigureCall
0.00% 52.838us 5 10.567us 7.9820us 17.106us cuDeviceTotalMem
0.00% 38.012us 5 7.6020us 7.2220us 7.9820us cudaSetDevice
0.00% 2.6600us 10 266ns 0ns 380ns cuDeviceGet
0.00% 2.2800us 3 760ns 0ns 1.9000us cudaGetDeviceCount
0.00% 1.9010us 2 950ns 381ns 1.5200us cuDeviceGetCount
0.00% 1.9000us 2 950ns 760ns 1.1400us cudaDriverGetVersion
I modded and bugfixed the the first version you put out on github. The second version you published (cuda 6.5 nosync) is faster than your first chekin but doesn't work on gtx 970.
I will do another mod and gain more.