So, the algorithm is fine as it is. If you increase the amount of memory required, you end up with a GPU-favoured implementation of scrypt.
I don't understand this line but the rest of your post is a welcomed commentary that I do intend to provide counter-arguments for.
I would assume that the more memory required the *less* feasible GPU mining became. For instance you could (if artforz released the code) mine scrypt coins with a GPU but it would be so inefficient that you might as well just mine them with the CPU. My understanding is that increasing the amount of memory required further would make GPUs even more pitiful. If you kept increasing the memory required CPU's would decrease in hash power. Some CPU's with smaller and or slower amounts of cache (or inefficient cache usage) would fail to keep up. This would push innovation to improve memory management in CPU's as people try to design ways to make CPU's address large cache sizes faster or make more efficient use of L2 and L3 cache.
We would first see more efficient mining software just as people keep improving the existing scrypt miners but ultimately we would be pushing for CPU's that are continuously improving at memory hard math.
Although you argue it is difficult to make large amounts of cache easy to address there is room for competition and innovation in this area as people push the boundaries on what is possible with the CPU.
Yes it sounds like a lot of very difficult work I agree but that's the whole idea. It is a speculation market for emerging CPU technology.
Short version: compared to (1024,1,1) increasing N and r actually helps GPUs and hurts CPUs.
Longer version:
While things are small enough to fit in L2, each CPU core can act mostly independently and has pretty large read/write BW, make it big enough to hit external memory and you've got ~15GB/s shared between all cores.
Meanwhile, GPU caches are too small to be of much use, so... with random reads at 128B/item a 256 bit GDDR5 bus ends up well < 20% peak BW, at 1024B/item that % increases very significantly.
end result, a 5870 ends up about 6 times as fast as a PhenomII for scrypt(8192,8,1). (without really trying to optimize either side, so ymmv).
The only way to make scrypt win on CPU-vs-GPU again would be to go WAAAY bigger, think > 128MB V array so you don't have enough RAM on GPUs to run enough parallel instances to mask latencies... but that also means it's REALLY slow (hash/sec? sec/hash!) and you need the same amount of memory to check results... Now who wants a *coin where a normal node needs several seconds and 100s of megs to gigs of ram just to check a block PoW for validity?