Seems like the funnel shift is missing in some of the 11 algorithms: (With a ROL compute 5.0 + devices will get a boost)
From Github (ccminer 1.2):
cuda_x11_luffa512.cu:
define TWEAK(a0,a1,a2,a3,j)\
a0 = (a0<<(j))|(a0>>(32-j));\
a1 = (a1<<(j))|(a1>>(32-j));\
a2 = (a2<<(j))|(a2>>(32-j));\
a3 = (a3<<(j))|(a3>>(32-j));
#define MIXWORD(a0,a4)\
a4 ^= a0;\
a0 = (a0<<2) | (a0>>(30));\
a0 ^= a4;\
a4 = (a4<<14) | (a4>>(18));\
a4 ^= a0;\
a0 = (a0<<10) | (a0>>(22));\
a0 ^= a4;\
a4 = (a4<<1) | (a4>>(31));
cuda_x11_cubehash512.cu:
#define ROTATEUPWARDS7(a) (((a) << 7) | ((a) >> 25))
#define ROTATEUPWARDS11(a) (((a) << 11) | ((a) >> 21))
etc..
By rewriting these macros shift+shift+or cuda instructions can be replaced with a single rol (3 times faster) (Compute maxwell / 5.0+)
http://cudamining.cc/url/releases/member/8
just click the version and it pops up