Regarding Dagger in retrospect, which was Vitalik's original proposed proof-of-work since before Ethereum was formed:
...can be mined very quickly by using a large amount of memory. The intent of this is to force ASIC/FPGA designers to be bound by RAM access speeds, while still allowing for verification by thin clients.
is memory bandwidth something that an ASIC can easily overcome?
You misunderstood. Try reading it again more carefully and slowly. The memory bandwidth only limits the speedup on an ASIC, not the electrical efficiency gains. I wrote a very complex statement. Try to grasp what I am saying about Cuckoo Cycle. AnonyMint/TPTB conjectured that with enough hardware threads, it loses the ability to amortize a single memory latency over 1 hash computation and instead up to 1024 (for 4KB page/row) hashes coalesced for 1 memory latency (row loading) power cycle. Thus computation bound and orders-of-magnitude more power efficient on the ASIC. John Tromp had argued that the probability of any one thread being coalesced was 2^-14 due to the ratio of the memory page (row) and bank sizes, but that is for any one h/w thread. And that was if the memory footprint of the Cuckoo table is greater than the memory bank size. His concept falls away when 1000s of hardware threads are run and/or smaller memory footprints are chosen. That is why TPTB_need_war urged tromp to buy a Kill-A-Watt meter and measure, but even that won't isolate to the metrics needed to estimate the advantage of ASIC. Need a more sophisticated bench test and probably some hardware simulation.
Cuckoo spends only 1/3 of its time doing hash computations, which take little power compared to DRAM switching rows. So the possible gain by ASIC hashing is less than 33%.
I didn't measure that. I estimated that doing 90 64-bit operations (one siphash call) is cheaper than operating the (typically 16k) sense amplifiers to read a new bitline. But I could be wrong about that... I'm assuming a Cuckoo size much larger than 16k of course.
Edit#2: and here Vlad and the Ethereum team couldn't understand what AnonyMint had considered and quickly dismissed in 2014 because by-definition a random circuit has no structure and thus can't be relied upon to have any proof-of-work qualities. Duh!
How extensively is this specific NP problem studied?
The closest seems to be the study of tradeoffs in undirected graph traversal such as
http://www.cs.toronto.edu/~bor/Papers/time-space-tradeoffs-undirected-graph-traversal-graph-automata.pdfwhich states:
Although it would be desirable to show a tradeoff for a general model of computation
such as a random access machine, obtaining such a tradeoff is beyond the
reach of current techniques. Thus it is natural to consider a ``structured'' model
(Borodin [16]), that is, one whose basic move is based on the adjacencies of the
graph, as opposed to one whose basic move is based on the bits in the graph's
encoding.
My model is rather odd in that you cannot easily figure out the neighbours of a node,
since edges are generated from nonces, and in the worst case you have to map over
all nonces to find the edges adjacent to a given node.
The paper establishes a quadratic lower bound on the
product of time and space for graph traversal, which I interpret
as good evidence for the edge trimming in my Cuckoo Cycle being optimal,
as it uses just 1 bit per edge.
They had an earlier paper as well:
http://homes.cs.washington.edu/~beame/papers/wag.pdfI found it helpful to also read the abstract of these two:
http://www2.tcs.ifi.lmu.de/~schoepp/Docs/formalcr.pdfhttp://www.sciencedirect.com/science/article/pii/S0304397500000190Your Cuckoo Cycle is a
log n space representation which discards the local information on the bidirectionality of the edges (i.e. the undirected graphic) so as to assume it is a DAG. The undirected quality of the actual graph is hidden from your representation, except by global re-enumeration of the edges (via the "micro" nonces). For readers, tromp never explicitly defines this terminology but one can deduce he uses "macro nonce" to mean the nonce for the block. The "micro nonces" are the enumeration of the edges in graph.
You have not shown that Cuckoo Cycle can guarantee that a representation that stores the linear space of the graph won't have solutions
which are more efficient, perhaps only on an ASIC and/or GPU within the
log(mn) product of time and space lower bound tradeoff. For example, it may be more efficient to decrease traversal workspace and apply a probabilistic or non-deterministic JAG, because it may make it more efficient to get a fewer number threads to coalesce on page row latency, while any increased hash computations are relatively insignificant on the ASIC. Or the undirected graph may find more cycles, more efficiently.
The edge trimming from David Andersen appears to throw away information at the representation (storage of the edges) stage. The paper you cited is referring to compression of the traversal workspace of storing the state machine for visited the graph, i.e. the "pebbles" and "Q state" in the JAG model.
Also the traversal workspace in a Cuckoo Cycle finding proof-of-work in an undirected case is amortizing workspace over numerous cycles found, which is a different algorithm than the single-shot s-t reachability traversal considered by the research papers cited above.