I just pushed to the repo an optimized GCN assembly version of ethash-new.cl for RX 470/480.
Each card should get a 1Mh/s boost with it. If this actually works, then I will extend its support to GCN1/GCN3 devices.
(I sold all of my GCN2 cards a while back...)
That's not optimized - you flipped the SLC and GLC bits, which will likely make it a tad SLOWER; it did when I tried that.
I was expecting just SLC (bypass L2) to help, though I recall Wolf's comments about GLC (bypass L1) actually helping. I'd even expect GLC to hurt performance if you weren't very careful to ensure data was read in 64-byte chunks.
p.s. There's also some easy optimizations to do with instruction reordering (though they might not make much difference in performance). For example:
/*d11c6a3e 01a9013c*/ v_addc_u32 v62, vcc, v60, 0, vcc
/*2a7e62b2 */ v_xor_b32 v63, 50, v49
/*dc5c0000 4000003d*/ flat_load_dwordx4 v[64:67], v[61:62] slc glc
/*dc5c0000 3b00003b*/ flat_load_dwordx4 v[59:62], v[59:60] slc glc
/*bf8c0171 */ s_waitcnt vmcnt(1) & lgkmcnt(1)
The v_xor_b32 can be moved to after the flat_load_dwordx4.