I didn't see this until just now.
3. Use the last 32bits%2^14 as the solution. If the solution==1968, block solved
How is the difficulty altered if the solution is a fixed value and not instead a variable value which the output of the hash must be less than?
Ok, here's the latest modification to the PoW.
1. Generate 1GB PsuedoRandom data using SHA512
2. For each 64K block - Repeat 10 50 times
2.1 Use the last 32bits as a pointer to another 64K block
2.2 XOR the two 64K blocks together
2.3 AES CBC encrypt the result using the last 256 bits as a key
3. Use the last 32bits%2^14 as the solution. If the solution==1968, block solved
Expect 1 solution per set.
This will offer a good level of GPU resistance for the following reasons -
1. Complexity - requires SHA512 hashing and AES CBC encryption
2. CPU instruction sets - many have specific AES instruction sets, GPU's don't
3. SHA512 - more efficient with 64 bit operations, but most GPUs are 32bit
4. Multiple AES encryption and XORing will keep the L2 cache busy, GPUs will be forced to use slower memory for massive parallelization.
I think more efficient GPU miners will be possible, but they should be delayed and not offer performance gains of more than 2X or 3X.
Even Haswell has 10X less FLOPS than top-of-the-line GPUs. Memory latency will be masked away to 0, if the GPU can run enough parallel copies. Your algorithm is going to be hobbled relative to the GPU by the 10X slower bandwidth of 20 GB/s speed to write out the initial 1 GB.
The use of dedicated AES instructions on CPUs might help a little but I doubt enough to stop the GPU from being 10X faster, and ASICs could implement AES to run faster.
The latency on L2 is still several cycles and the latency on the CPU will asymptotically go to 0, although you have that 1 GB to limit the number of copies the GPU can run, i.e. 6 parallel copies on a 6 GB GPU (which might be sufficient to eliminate most latency).
You might get the 3X faster expected result, but I am not confident of that.
How does this algorithm validate faster and with less memory? I think I know, but I don't want to say. I want to know what you came up with.
On the rough guesstimate (note I am quite sleepy at the moment), I can't see you've accomplished anything the Litecoin did not already except the extra DRAM requirement, i.e. probably 10X faster on GPU and vulnerable to ASICs (running with DRAM). Or did I miss something?
Note Litecoin ASICs are apparently in development now.