Hi all,
First of all, I'm very glad that there's lots of people now out there trying to improve the kernels. Thanks for doing this!
Second, if I may ask this of you... could you please be very very careful when introducing these modifications? By careful, I mean test your changes for at least a day or so and see what happens. As well, looking at the disasm results in kernel analyzer never hurts
Third... Most of the changes that I saw deal with removing some adds and stuff. Normally, any modern compiler should be able to remove all useless adds, such as adding two constants, etc. So I saw no reason for these changes to actually help with the speed. However, some of them were helping, which warranted at least some looking into why they help. So...
Fourth. Here's why I think they help. Most compilers will rip out expressions like (a&b)+c and replace it with a single constant, as long as a,b,c are constants. However. They won't do this if you're using an intrinsic in the expression. For instance, if you do a rotate as x<<(32-n) | x>>n, for constants x and n the whole expression will get replaced with just a single constant. However, if you use amd_bitalign for this, it will not get replaced with a constant (especially if you do it for Ch/Maj, which is patched - how can it?). Yet, if you do a rotate using bitshifts but x or n aren't constants, you'll be slowing things down because now the compiler can't optimize it and you've replaced a bitalign() with a bunch of ops. So, long story short, less intrinsics and more constant expressions is probably the reason your changes help.
PcChip and I tested out this theory; we've replaced the rotates that use intrinsics with the rotate that uses shifts and or-s (i.e. the stuff the compiler can easily see through) for inputs that are constant. We've got 1-2% improvement, PcChip posted the kernel here
http://pastebin.com/NPDTfAVd, but we've done it only for the rotates... if somebody could go through the kernel carefully and find other constants and constant expressions that could be removed, it would probably help even more.
Fifth. Feel free to donate coins, but also consider donating to PcChip and the original authors of these kernels; these people put a lot of work into the miners that everybody is using, and they deserve donations more than some dude who happened to notice a 3% improvement to Ma()
(also, I'm not being critical or overly judgmental, just sharing some stuff that I happen to know; it is all my personal opinion, I may be wrong, so feel free to criticize and/or disagree)