Thanks to those who have donated.
instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions.
or, better, make an unrolled loop.
that way it's much more compact and easier to debug.
Umm, these are preprocessor directives. The decision is made at compile time. There is nothing left to unroll.
do you see the repeated instructions?
just change the "if" structure and you can remove them ;-)
i.e. if worksize <= 128 you need to do some additional rotates compared to default (256), some additional others if worksize == 64....
although the alternative for loop is a much more elegant solution and the difference in speed is negligible.
What you suggest results in less linear memory writes which isn't good usually. I prefer to avoid loops if possible.
T0_L[lclid] = T0[lclid];
T1_L[lclid] = rotate(T0[lclid], 8UL);
T2_L[lclid] = rotate(T0[lclid], 16UL);
T3_L[lclid] = rotate(T0[lclid], 24UL);
T4_L[lclid] = rotate(T0[lclid], 32UL);
T5_L[lclid] = rotate(T0[lclid], 40UL);
T6_L[lclid] = rotate(T0[lclid], 48UL);
T7_L[lclid] = rotate(T0[lclid], 56UL);
#if (WORKSIZE < 256)
T0_L[lclid + 128] = T0[lclid + 128];
T1_L[lclid + 128] = rotate(T0[lclid + 128], 8UL);
T2_L[lclid + 128] = rotate(T0[lclid + 128], 16UL);
T3_L[lclid + 128] = rotate(T0[lclid + 128], 24UL);
T4_L[lclid + 128] = rotate(T0[lclid + 128], 32UL);
T5_L[lclid + 128] = rotate(T0[lclid + 128], 40UL);
T6_L[lclid + 128] = rotate(T0[lclid + 128], 48UL);
T7_L[lclid + 128] = rotate(T0[lclid + 128], 56UL);
#endif
#if (WORKSIZE < 128)
T0_L[lclid + 64] = T0[lclid + 64];
T0_L[lclid + 192] = T0[lclid + 192];
T1_L[lclid + 64] = rotate(T0[lclid + 64], 8UL);
T1_L[lclid + 192] = rotate(T0[lclid + 192], 8UL);
T2_L[lclid + 64] = rotate(T0[lclid + 64], 16UL);
T2_L[lclid + 192] = rotate(T0[lclid + 192], 16UL);
T3_L[lclid + 64] = rotate(T0[lclid + 64], 24UL);
T3_L[lclid + 192] = rotate(T0[lclid + 192], 24UL);
T4_L[lclid + 64] = rotate(T0[lclid + 64], 32UL);
T4_L[lclid + 192] = rotate(T0[lclid + 192], 32UL);
T5_L[lclid + 64] = rotate(T0[lclid + 64], 40UL);
T5_L[lclid + 192] = rotate(T0[lclid + 192], 40UL);
T6_L[lclid + 64] = rotate(T0[lclid + 64], 48UL);
T6_L[lclid + 192] = rotate(T0[lclid + 192], 48UL);
T7_L[lclid + 64] = rotate(T0[lclid + 64], 56UL);
T7_L[lclid + 192] = rotate(T0[lclid + 192], 56UL);
#endif