I am currently doing similar optimization work for the scrypt hashing used in Litecoin. About doubled the performance I am getting from most of my cards compared to OpenCL miners. This still sucks big time when compared to ATI cards, but it sucks a bit less than before.
With the scrypt hashing it appears much more difficult to lower the kernel's register count, as the required Salsa20/8 rounds are fairly complex beasts, also the memory-hard part of the algorithm really bangs on the memory controller.
Watch out for potential Windows binary releases in the next days. I will post into the alt cryptocurrency forum.
for ltc or btc?