Yes it is very simple and elegant After the Fact... but I have posted bounties trying to find better proof of work and spent weeks trying to find a way to make a fast to validate memory hard problem. Then I had to find a way to make it 'scale' the difficulty smoothly. Lots of little things to make something simple, elegant, and algorithmically secure.
Interesting idea, but for crypto-coins a proof-of-work scheme that isn't a random lottery - that is if not every attempt at creating a valid PoW has an equal chance of succeeding - can be really problematic because it means faster miners have an advantage. You give an example of a system where miners wouldn't want to add a transaction to the block they were mining because they'd have to start over. Such a system would mean that whom ever had the fastest implementation of the scheme would find the majority of the blocks, which really rewards people with highly-tuned custom hardware implementations - exactly what you are trying to avoid.
I'm also extremely skeptical of the idea that you've actually created a ASIC resistant scheme. You mention parallelism in your paper as a possible problem, but brush it off assuming that a hash table would be the optimal implementation, and lock contention and "atomic operations" would prevent a highly parallel implementation; I'm very, very skeptical that you're correct.
Fundamentally your design has two basic primatives: the generator function G(m, k)=H(m + k) producing candidate digests, and the
content addressable memory that stores potential digests and allows for matches to be searched for. The problem is that a solution is defined as successful if
any a and b are found such that G(m, a) & (2^d)-1 == G(m, b) & (2^d)-1 for some difficulty d, a small positive integer. (a minor correction form your paper; you forgot to include the masking operation that makes finding a and b possible at all)
As you hint an ideal operation will run multiple generators in parallel - the problem is that an optimal implementation of the content addressable memory is very probably not a simple hash table. Here we have a situation with really weak demands on the CAM: it's ok if it doesn't always find a match, it's ok if there is no global synchronization, it's ok if sometimes it returns a false positive, and worst of all it doesn't even have to actually store all the data! Dedicated silicon implementations of CAMs are already really common for things like network routers, and they have vastly better performance than lookup tables built from commodity memory and CPU's. They also use a plethora of complex and application specific tricks to get the performance they need, even going as far as to make use of analog computation and probabilistic retrieval.
For instance off the top of my head a very fast design with very good utilization of silicon would be to use a custom ASIC consisting of a number of generator units feeding their results into a probabilistic CAM. A nice trick we can take advantage of is that for each candidate digest the generator function produces, we only actually need to store it's index to recreate it. That is if G(m, i)=d_i, we only need to store i, and we can even cheat further by only storing some of the bits of i, doing a quick brute force search after the fact to figure out which actual i was the match.
Hardware CAMs are usually implemented as a series of cells, with some number of search lines connected to each cell in parallel. Each cell matches the search line against it's own contents, asserting a series of read-out lines if the contents match. Writing a value to a cell is similar to regular memory. Matches are done very quickly, a single cycle, but at the cost of high power consumption as the memory grows larger. In our case we want a match to be done on the value of G(m, i), and we want the CAM cell to return the index i.
Lets suppose the difficulty is such that we're finding 64-bit birthday collisions or 2^32 items for a 50% chance of collision. This means our
index values will need to be about 32-bits, as we have to search from 0 to 2^32 for a 50% chance of finding a collision. Naively we've have to store 64 bits per value for the digest, and 32 bits for the index, or 96bits * 2^32 = 48GiB. But we can do a lot better... First of all suppose only store 24 bits of digest in each cell: by the time we've got 2^32 items we'll get on average 2^8 false positives - pretty manageable with some parallelism to test them all. Next we can split our gigantic CAM array into multiple independent arrays, say 256 of them. Now for that 2^8 false positives, we only need to store 16 bits - pretty good! As for the indexes, we can cut down on them too: lets drop the lowest 8 bits, and just bruteforce the solution for the previous candidate digest at a cost of 2^7 average operations. Sure, that wasn't free, but we're now down to just 48 bits per cell, or 24GiB total.
Now I'm not giving you exact numbers, but that's not the point: I'm showing you how what's optimal turns out to be a crazy-looking hyper-optimized set of custom ASICs. Heck, a quick read of the literature on CAM design says that you'd probably go even further in this case, doing really crazy stuff like using multi-level kinda analog DRAM technology in the CAM cells, making convoluted trade-offs between actually storing indexes, or just leaving them implied by what bank the results are being stored in, etc.
I really suspect that rather than creating a ASIC hard design where commodity hardware has the same profitability as anything else, you've actually done the exact opposite and created a PoW function where the optimal implementation costs a tonne of money to implement, involves custom ASICs, and has performance miles ahead of anything else. Even worse is that with the non-lottery-random "momentum" aspect of what you're doing whomever implements this crazy custom hardware first will not only have the highest hashing rate, but they'll also solve the PoW problems the fastest, and hence get nearly all blocks.
Finally note that if G() was made to be made non-parallizable the idea might work, for instance by defining it as G(i)=H(m+k) ^ G(i-1), but then you wouldn't be able to do verification cheaply and might as well just use scrypt.
tl;dr: Cool idea, but the end result is probably the exact opposite of what you want.