tromp, the bottom line is that when I wrote that document I was trying to find a way to make a PoW where CPUs would be as power efficient as GPUs and ASICs, so that a professional miner wouldn't be more profitable than a CPU miner. That was back when I hadn't yet contemplated the solution of making mining unprofitable. And in the portion of the paper I omitted, I concluded that I had failed (even with fiddling around the various SRAM caches on Intel CPUs). Even if one makes a Scrypt-like random walk through memory entirely latency bound, thus GPU can run multiple instances until the latency is masked by computation and becomes computation bound or memory bandwidth bound. And believe in both cases, then the GPU will be more efficient on computation being performed.
What Cryptonote's Cryptonite PoW hash apparently does is make it impossible to run enough instances on a typical GPU (with its limited memory of say 6GB unless one were to customize one) to overcome the AES-NI instructions incorporated into the memory hard algorithm, since the GPU is apparently only at par in computational efficiency on AES-NI. Or what I think is really going on but I haven't confirmed it, is Cryptonite is AES-NI bound, so the GPU remains at parity. Which is exactly the direction I investigated next in 2014 after abandoning memory hard PoW (even the NP complexity class asymmetric variant such as yours). Also CN attempts to fiddle around the size of the various SRAM caches, but that can be a pitfall in scenarios such as ASICS or Tilegra or other hardware competitors.
So that is why I had abandoned memory hard PoW and investigated a specific instruction in the AES-NI instruction set which appears to have the highest level of optimization in terms of power efficiency as far as I can estimate. This also meant the PoW hash could be very fast (noting Cryptonite is slow) and would help with asymmetric validation of PoW shares in the DDoS scenario (although in my latest 2015 coin design
I can verify Merkel signatures orders-of-magnitude faster than any memory hard PoW hash so the point becomes irrelevant).
I have recently entertained the thought that the only way to make them (nearly) equal with a memory hard approach would be to be no computation or for the computation to be so small so as to require an inordinate amount of total RAM or where the memory bandwidth bound would limit the computation latency masking to an insignificant portion of total power consumed. But this may not be easy to accomplish because DRAM is so power efficient. I also noticed an error in my thought process before in my rough draft paper where I hadn't contemplated another way to force a serial random walk that much more strongly resists memory vs. computation tradeoff and for which the computation would be very tiny relative to memory latency. And now my goal is no longer for them to be equal (and besides even if they were equal the mining farms have up to an order-of-magnitude cheaper electricity and more efficient amortization of power supplies), but just to be within say an order-of-magnitude
because I am targeting unprofitable mining and the ratio dictates the ratio of unprofitable miners to those miners who desire to be profitable required for all miners to be unprofitable. This approach might be superior to the specific AES-NI instruction I had designed a PoW hash around in 2014.
But the main reason I revisited memory hard PoW is because I can't optimize an AES-NI instruction PoW hash from a browser (no native assembly code and because WebGL on mobile phones means as GPU or ASIC orders of magnitude more power efficient and hardware cost efficient) which impacted a marketing strategy I was investigating. However I have concluded last night that the marketing strategy I was contemplated is flawed because there isn't enough value in the electricity (and memory cost) consumed by the PoW hash to give sufficient value to computing the hash for transferred income (even if unprofitable) on a mobile phone. It turns out that marketing is much more important than PoW in terms of a problem that needs to be solved for crypto currency. The income transfer would make music download bandwidth profitable, but that is
peanuts compared to the value generated by social media advertising and user expenditures. I am starting
to bump up against some fundamental marketing barriers, e.g. microtransactions are useless is most every scenario (even music!), mobile is the future and there isn't enough electricity to monetize anything from PoW (besides competing on electricity consumption due to computation is a losing strategy w.r.t. to GPUs and ASICs). The money to be made in social media isn't from monetizing the CPU/DRAM nor the
users' Likes as Synereo is attempting, but from creating value for the users (then either profiting on the advertising and/or the users' expenditures on what they value). This unfortunately has nothing to do with crypto currency, although it is very interesting to me and requires a lot of fun programming & design so I am starting to get frustrated with crypto currency as being an enormous time waster for me. Three years of researching and still not funding a sure project to work on in crypto currency that doesn't just devolve to a P&D to speculators, because crypto currency has no significant adoption markets (subject to change as I do my final thinking on these fundamental question before I quit crypto).
Bottom line is your Cuckcoo Cycle PoW can be parallelized and so the GPU can employ more computation to mask some of its slower main memory latency up to the memory bandwidth bound. With a guesstimated power efficiency advantage of perhaps 2 - 3X, although you have only estimated that from TDP and not actually measured it. It behoves you to attempt to measure that as it is a very important metric for considering whether to deploy your PoW. The reason it can be parallelized is because the entropy of the data structures is not a serial random walk (which is the point I was trying to make in my poorly worded text from the rough draft of an unpublished paper). Also I am wondering if you tried instead of hitting the memory bound with one instance, run multiple instances on the GPU since I assume you are not using 6GB for each instance? In this case the tradeoff between increased computation (thus masking memory latency) and memory bandwidth may be more favorable to the GPU?
Note that other paper's proposed PoW
can also be parallelized up to the memory bandwidth bound but afaics they didn't measure relative power efficiency:
thus the total advantage
of GPU over CPU is about the factor of 4, which is even
smaller than bandwidth ratio (134 GB/s in GTX480 vs 17
GB/s for DDR3). This supports our assumption of very limited
parallelism advantage due to restrictions of memory bandwidth
tromp, I think you will remember the discussion we had in 2014 about ASICs and I was claiming it could be parallelized and you were pointing out the bandwidth limitation at the interconnection between IC chips due to the fact the memory can't fit on the same silicon chip as the computational logic (or that not all the memory can fit on one chip).
So what I am really saying above is that
afaics the fundamentally important invention of lasting value that I have found for crypto is unprofitable mining. I haven't decided yet whether to pursue that to implementation.