perf is reduced on the 750Ti (linux), else i found the way to enhance a bit your groestl change (commited)
EDIT: hmm in fact not exactly, but... its hard to compare
my current version for x11 on a 750Ti / linux (2800kH) :
Time(%) Time Calls Avg Min Max Name
20.75% 3.64387s 93 39.181ms 39.064ms 41.831ms x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
18.75% 3.29341s 94 35.036ms 34.963ms 39.033ms quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
12.87% 2.26079s 93 24.310ms 24.180ms 27.077ms x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
11.11% 1.95157s 93 20.985ms 20.926ms 23.382ms x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
7.24% 1.27073s 94 13.518ms 13.483ms 15.056ms x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
5.32% 933.86ms 94 9.9347ms 9.8739ms 11.096ms quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
5.04% 884.74ms 94 9.4122ms 9.2574ms 10.502ms x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
3.08% 540.45ms 94 5.7494ms 5.7279ms 6.3724ms quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
3.07% 539.01ms 93 5.7958ms 5.7538ms 5.8993ms x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
2.80% 491.53ms 93 5.2852ms 5.1886ms 5.4446ms x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
2.76% 484.47ms 94 5.1540ms 5.1358ms 5.7421ms quark_blake512_gpu_hash_80(int, unsigned int, void*)
2.71% 475.78ms 94 5.0615ms 5.0070ms 5.6117ms quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
2.60% 456.75ms 94 4.8591ms 4.8225ms 5.4034ms quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
1.61% 283.04ms 93 3.0434ms 3.0159ms 3.3809ms x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
0.28% 49.941ms 93 537.00us 534.29us 543.03us cuda_check_gpu_hash_64(int, unsigned int, unsigned int*, unsigned int*, unsigned int*)