So, as expected (well, I had this same thought in the early days of the development of equihash) in the end memory bandwidth is the performance limiter. Thus you have discovered really quite quickly in your development cycle that you can run the solver significantly faster, with a significantly faster memory architecture and memory bus, ie, on a video card.
I know that 16gb-32gb of memory is becoming commonplace, but it occurs to me that a new target for these kinds of proof of work algorithms could be disk storage, and this can be forced by requiring over 32gb of memory. Very few people have much more than 32gb of memory on their system, where as most have more than 32gb of storage on disk. Presumably that would naturally mean that NVMe SSD's become a vessel for this.
So, I have started reading into your algorithm, and it occurs to me that you can push this out of the memory bandwidth box and onto disk simply by requiring significantly more than 8gb of memory. Very few video cards have more than 8gb of memory, and to be safe, targeting 16gb of memory would put it outside the range of GPU processing and bring it back down to CPU. Pushing the memory requirement beyond 32gb would shift the performance bottleneck into disk caches.
I haven't read perhaps as deeply as I could have so far into the algorithm, but as I gather from initial reading, you can force the memory requirement upwards by requiring longer paths in the graph. 42 is a lot, but as you have found, around 3gb is enough to store random graphs that give you this nice (funny) number of paths in a cycle. However, what would happen if you raised the minimum to, say, 56 nodes in a cycle, or 64. I would think the odds of finding this level of solution would be powers of two times the number of increased minimum nodes in a cycle, more likely beyond powers of 10 times the 42 node solutions.
As an interesting aside, these types of large graph tables are routinely used in rasterisation and ray tracing algorithms, the former have been pretty much maxed out such that photorealism requires the equivalent of a pair of 1080ti's at anything above 3840x2160 resolution, at anything more than 60 frames per second.
I am looking into this because I am seeking a PoW algorithm that will not even fall to GPU within 12 months. So, I am going to explore the Cuckoo Cycle, but with a massively increased requirement for numbers of nodes forming a loop out of a seed generating a graph. I want to see if I can force my machine into swap, and turn the NVMe drive on my machine into the primary bottleneck, which will drastically reduce the solution rate. Yet at the same time, an NVMe is not that expensive, but this bus is definitely slower than a PCI bus and slower than a memory bus.
Onward and upward... to move from memory hard to IO hard
PS: I see that the cuckoo cycle is an implementation of the graph solving puzzle to find the shortest path between some arbitrary number of nodes in a graph. It's quite interesting because this path-finding puzzle was at the centre of what enabled the implementation of internet routing. It naturally requires a certain minimum of processing and memory utilisation, even based on a static graph such as (relatively static) internet routes. It occurs to me that a next generation beyond loading up storage bandwidth would be to bind the solution to a network graph, which naturally greatly increases the latency and greatly reduces the throughput of the finding of solutions, though this also introduces byzantine attacks to cut the paths and prevent the finding of solutions, which would depend on cooperative action by nodes to respond to the solvers. Just a thought.