On a more serious note: if you state the performance is now at 80% of theoretical maximum, we're basically there, right? ETH miners also peak at about 80-85% of the theoretical maximum. Does the same rule apply here?
I still have some work to do before I write my own miner from scratch. I like to *really* understand the problem before I start writing code, and there's still some parts of the GCN architecture that I'm figuring out.
Eth miners max out at around 93% of the theoretical maximum. 24Mh/s is the theoretical max for a R9 380 with 6Gbps memory, and I've been able to get 22.3Mh out of a couple cards. You'll never reach 100% due to the fact that refresh consumes some of the bandwdith, perhaps as much as 5%.
p.s. I also have another idea that should work on 4GB cards. The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion. This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record. This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones. This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration. That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second. Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.