Even with those changes, my calculations indicate that a ZEC miner would be limited by the core clock, according to a ratio of approximately 5:6. In other words, when a Rx 470 has a memory clock of 1750Mhz, the core would need to be clocked at 1750 * 5/6 = 1458Mhz in order to achieve maximum performance.
If the row counters can be kept in LDS or GDS, the core:memory ratio required would be 1:2, thereby allowing full use of the external memory bandwidth. There is 64KB of LDS per CU, and the AMD GCN architecture docs indicate the LDS can be globally addressed; i.e. one CU can access the LDS of another CU. However the syntax of OpenCL does not permit the local memory of one work-group to be accessed by a different work-group. There is only 64KB of GDS shared by all CUs, and even if the row counters could be stored in such a small amount of memory, OpenCL does not have any concept of GDS.
This likely means writing a top performance ZEC miner for AMD is the domain of someone who codes in GCN assembler. Canis lupus?
Why is it MY problem?
I'm a bit busy with VBIOS mods at this point. Figuring out what timings are causing slowdowns in algos like Eth is really fun and interesting. For example, once tRRD was lowered on the strap I was using, I gained half an MH/s (scales with clock.) Taking a lower strap entirely, however, will crash it. I've found tRFC to be WAAAAY too damn high, as well - this has somewhat less of an effect since DRAM is only refreshed so often, but it also helps. Writing tools to work with this shit also is taking up time.
I did this for about 2 months with the radeon 480 cards. Was a very time consuming process that did yield some interesting results.
I am now looking into miner development. I think there is a bigger picture here, but that is for private discussion.