Linux profile of your repo, indeed big difference :
sp - before echo (linux x64)
==11174== Profiling result:
Time(%) Time Calls Avg Min Max Name
20.76% 2.87625s 53 54.269ms 54.098ms 55.278ms x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
18.83% 2.60877s 54 48.311ms 48.168ms 53.868ms quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
13.02% 1.80384s 54 33.404ms 32.752ms 37.241ms x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
11.04% 1.52931s 53 28.855ms 28.780ms 30.472ms x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
7.25% 1.00414s 54 18.595ms 18.548ms 20.737ms x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
5.32% 737.65ms 54 13.660ms 13.589ms 15.234ms quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
5.03% 697.42ms 54 12.915ms 12.778ms 14.462ms x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
3.05% 422.23ms 53 7.9665ms 7.8972ms 8.0252ms x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
2.90% 401.89ms 54 7.4425ms 6.9065ms 8.3138ms quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
2.90% 401.74ms 54 7.4396ms 7.4077ms 8.2859ms quark_blake512_gpu_hash_80(int, unsigned int, void*)
2.77% 383.50ms 53 7.2358ms 7.1146ms 7.3789ms x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
2.69% 373.04ms 54 6.9082ms 6.8450ms 7.7322ms quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
2.55% 353.48ms 54 6.5459ms 6.5278ms 7.2944ms quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
1.60% 221.22ms 53 4.1741ms 4.1419ms 4.2535ms x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
sp - 12d436ae1ecdc5e647a6a1576b98c4803510b13f
==25578== Profiling result:
Time(%) Time Calls Avg Min Max Name
20.56% 6.72060s 127 *52.918ms 52.822ms 53.985ms x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
18.89% 6.17511s 128 48.243ms 48.147ms 53.860ms quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
12.65% 4.13517s 128 *32.306ms 32.181ms 36.017ms x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
11.29% 3.69123s 128 28.838ms 28.787ms 30.680ms x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
7.27% 2.37746s 128 18.574ms 18.547ms 20.732ms x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
5.35% 1.74723s 128 13.650ms 13.589ms 15.257ms quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
5.05% 1.65134s 128 12.901ms 12.699ms 14.372ms x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
3.12% 1.01978s 128 7.9670ms 7.9183ms 8.0247ms x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
2.91% 951.09ms 128 7.4304ms 7.4097ms 8.2771ms quark_blake512_gpu_hash_80(int, unsigned int, void*)
2.88% 941.80ms 128 7.3578ms 6.9981ms 8.3027ms quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
2.83% 926.04ms 128 7.2347ms 7.0956ms 7.3374ms x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
2.72% 887.67ms 128 6.9349ms 6.8876ms 7.7173ms quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
2.56% 836.91ms 128 6.5384ms 6.5282ms 7.2936ms quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
1.63% 533.83ms 128 4.1705ms 4.1391ms 4.2481ms x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
Are you testing on linux too ? or just in windows ?
Still trying to get the same gains on windows... but that take a lof of time