Okay so now I understand you are sharing the same 1 GB for all threads and starting their walk from one of the 16,384 chunks in the 1GB. Chunk size is 64 KB.
So the GPU and ASIC will only need 1 GB for up to
16,384 (1<<14) threads. This was one of the criticisms about massive parallelization I made against Momentum for ProtoShares.
So the GPU will be compute bound on AES, and the ASIC will likely be memory bandwidth bound.
I don't see how L2 is even being employed in your algorithm. You are reading 64KB chunks of data from 1 GB, so you are not even in L3. So it appears you are compute bound on AES.
The main memory bandwidth on the CPU is 20 GB/s on desktop grade Intel Core. So clearly your algorithm is AES compute bound at 3 GB/s.
I don't know what it will cost to put a fast memory bandwidth interface together with an ASIC. It should be orders-of-magnitude faster and lower power than the CPU.
The only thing holding the GPU back is the lack of specialized AES circuitry, but note my prior post documenting that such circuitry doesn't require many transistors. GPU manufacturers could decide to add this perhaps or perhaps someone figure out a way to piggyback a cheap $50 ASIC on a mid-range GPU memory bus to get 50X performance.
Also perhaps someone can put an ASIC (or FPGA) or several Intel Core i5 on a PCIe card, since there is no
sequential memory bound in this algorithm.
Note there is a way to make an scrypt-like hash sequential, CPU-only, and fast validating. That was my major breakthrough recently.
It appears you bought some time against GPUs and ASICs, but as far as I see you don't have a CPU-only coin forever into the future.
P.S. I am guessing 1968 is the year you were born.