I tried your function on my Linux config but it does bring significant performance increase.
Mainly due to the fact that adding temporary variable add more spill move which are slower, sometimes it is better to recompute.
On your hardware you have much more available registers, performance increase should be more significant.
A tip, May be you can try to play with the maxregister in the makefile, for compute cap 5.0, nvcc cuda 10, use 120 registers.
The random problem you have may also be due to wrong register sharing between thread, it can explain the strange and random behavior. Reducing the number of used register by inlining also reduce the probability that this happens.
It might be an explanation...
With "-maxrregcount=50" I got 188 MKeys/s speed (but there are are still errors).