While my initial analysis was focused on the external GDDR5 bandwidth limits, current ZEC GPU mining software seems to be limited by the memory controller/core bus. On AMD GCN, each memory controller can xfer 64 bytes (1 cache line) per clock. In SA5, the ht_store function, in addition to adding to row counters, does 4 separate memory writes for most rounds (3 writes for the last couple rounds). All of these writes are either 4 or 8 bytes, so much less than 64 bytes per clock are being transferred to the L2 cache. A single thread (1 SIMD element) can transfer at most 16 bytes (dwordX4) in a single instruction. This means a modified ht_store thread could update a row slot in 2 clocks. If the update operation is split between 2 (or 4 or more) threads, one slot can be updated in one clock, since 2 threads can simultaneously write to different parts of the same 64-byte block. This would mean each row update operation could be done in 2 GPU core clock cycles; one for the counter update, and one for updating the row slot.
Even with those changes, my calculations indicate that a ZEC miner would be limited by the core clock, according to a ratio of approximately 5:6. In other words, when a Rx 470 has a memory clock of 1750Mhz, the core would need to be clocked at 1750 * 5/6 = 1458Mhz in order to achieve maximum performance.
If the row counters can be kept in LDS or GDS, the core:memory ratio required would be 1:2, thereby allowing full use of the external memory bandwidth. There is 64KB of LDS per CU, and the AMD GCN architecture docs indicate the LDS can be globally addressed; i.e. one CU can access the LDS of another CU. However the syntax of OpenCL does not permit the local memory of one work-group to be accessed by a different work-group. There is only 64KB of GDS shared by all CUs, and even if the row counters could be stored in such a small amount of memory, OpenCL does not have any concept of GDS.
This likely means writing a top performance ZEC miner for AMD is the domain of someone who codes in GCN assembler. Canis lupus?
Core speed has more of an effect on 480s but they are still limited by memory bandwidth.
I'm very funny that you even still protect 290,390 here is about the memory of 4xx
According to AMD According to a CU efficiency increased by 15% compared to the Radeon R9 290. When processing tessellation in conjunction with heavy duty AA efficiency gains can be double or even triple. Supported data compression, thus improving memory bandwidth. In particular, supported by Delta Color compression algorithm that allows you to encode the color difference. On this technique we described in the description of the architecture NVIDIA Pascal. AMD has such compression is maintained including the Radeon Fury X, but the effectiveness of the algorithms at Polaris 10 above. With this increase in the efficiency of a data chip content bus word length of 256 bits. The Radeon RX 480 uses GDDR5 memory chips with an effective rate of communication 8 GHz.
And for that AMD has introduced new regulations of memory !!! such as FP16 and 16 Int.
Which I think Claymore's does not use, for this reason, the new data on the time of top 480 does not operate at full capacity. And at the same time using the old manual of memory with which he revived the old 7xxx to work at such speeds
And compared to the 7xxx, 290 and 390 may be given even greater speed, including 480 if you use the new instructions of memory that only support new models 290-390-480, though only suffer 7xxx model that greatly impact on mining in overall, since the data pattern immediately lose their significance in mining
http://i11.pixs.ru/storage/1/6/0/03amdradeo_6383058_24215160.png