I'm not convinced that the only way to parallelize these algorithms is the way that it's being done currently. It is often possible to write GPU codes where several threads or even a whole warp/wavefront work collectively on an algorithm step. I haven't looked at the details of scrypt-chacha specifically, but I wouldn't be surprised if there are alternative algorithm formulations other than the one you refer to. The top-end GPUs today have 8GB to 12GB of RAM. In the next two years, there will be GPUs and other GPU-like hardware (e.g. Xeon Phi) that will have significantly more memory than they do now, likely in the range of 32GB. I've read analyst articles that expect that Intel will put at least 16GB of eDRAM onto the next Xeon Phi (though likely on its own separate die), a much larger scale variant of what Intel is already doing for integrated graphics. Next week is NVIDIA's GPU conference, perhaps there will be some public announcements about what they're doing for their next-gen GPUs.
That's all good information, and I'm sure you're quite correct on there being alternate ways to rework these hashes for new hardware - that's the one aspect of mining my knowledge is very shallow on. Regarding those badboy GPU's with 8 and 12GB of memory - they aren't real common, and their cost would be prohibitive compared to running multiple "smaller" GPU's, but that's where the schedule of increasing N comes in - it's taking a stab at where computing power will be in the future, and it could be very wrong, but it's still scaling up the requirements over time.
The Xeon Phi looks to be an interesting beast, and I was unfamiliar with it until you brought it up. The specs call for 61 cores at 1.238GHz for the high end machine, which doesn't sound massively parallel. Time will tell, but I'm not going to be the guinea pig to plunk down $4,000 USD to find out.
Christian Buchner (of cudaminer) has been in touch with nVidia, and they're on board with crypto-mining - I believe they've evan assisted with optimizaing some of his kernel code. Regarding their announcement, I do know that their Maxwell architecture is already providing improved performance per watt on the mid-range 750Ti card.
The NVIDIA Titan cards have 6GB, and they tend to hover close to a kilobuck. The Xeon Phi (in its current form at least) is no match for state-of-the-art GPUs. I mentioned Xeon Phi only to provide context that multiple vendors are already building hardware with large high bandwidth memories now, and that the memory capacities (again, from multiple vendors) are expected to go up dramatically in the near future.
I understand the sticker shock, $2K to $4K is costly, but then that's what the high-end ASIC boards cost for crypto currency mining. To keep the ASICs out, there just has to be a commodity option that's price-competitive, and there are definitely high-end GPUs that, for a memory-bandwidth-bound algorithm at least, should still be more cost effective than low-volume ASIC boards for any non-trivial memory capacity.