X11 is just a "watered" down version of Scrypt, it's less memory intensive and isn't as efficient as Scrypt.
I've seen a few threads full of people wondering and coming up with ideas as to why X11 runs cooler than Scrypt. The leading theory seems to be that because Scrypt is memory intensive, it causes more power use. This is incorrect; in fact, I believe the opposite may be true - that the memory accesses of Scrypt may make it run cooler than it otherwise would.
X11 actually uses far more memory than it needs to, which causes it to be a lot easier on the core than it should - it's waiting for memory half the time. While it's nowhere near as memory-intensive as Scrypt, it's still slow for partially this reason. That's why it runs cooler - the compute capability isn't being used very much.Along those same lines, it runs cool because of a lack of occupancy, causing there to be less concurrency. AMD calls them "concurrent waves" or "waves in-flight." If you'll see here (nsfw):
https://ottrbutt.com/tmp/skein-analysis.png - you see Skein is using 73 registers, limiting it to three waves in flight. Many of the hashes in X11 have the same level or worse occupancy because of their VGPR usage. An analysis of a rewritten Skein that hasn't been optimized well yet you can see here (nsfw):
https://ottrbutt.com/tmp/skein-analysis2.png - 51 registers, allowing another wave in flight, which is a big win. Additionally, drop three more registers from that, and you get yet another. This uses a lot more of the card's compute capabilities.
The X11 kernel also takes a hit because AMD is quite sensitive to code size - the code cache on GCN devices (7xxx and up) is 32kb - the Shavite implementation used in SGMiner 5 weighs in at 76,680 bytes. This might not be so bad if it doesn't use many registers - more waves in flight help you hide memory access times - but it's limited to 4 waves in flight by the local memory usage, and worse, to two waves in flight by the VGPR usage. Since I found reducing the VGPR usage enough to gain more waves in flight increased branching (decision making, something GPUs don't do very well) to a point where it was slower than the original, I instead worked to make it fit in the code cache. In the process, I increased the SGPR usage from 18 to 19, decreased the VGPR usage from 107 to 93, but most importantly, shrunk the code size to 29,820 bytes with a simpler, more elegant implementation.
JH needed the opposite approach - the SG 5 code comes to 52,796 bytes, over 32k - but in this case, the implementation is so fucked that it's using 91 VGPRs, limiting it to two concurrent waves. A rewritten, decent version drops that to 71 VGPRs, allowing three waves in flight. While the code size is 52,676 bytes - basically the same - the extra wave in flight helps hide memory access times, causing it to be faster.
There are more examples, but I think this serves to make the point, and put the question to rest.