OP is on the right track.
What I can say is this:
1. X11 is trivial to obliterate via FPGAs and ASICs, and will be obliterated in the near future if X11 coins continue to appreciate. X11 only requires compute resources, which are cheapest as far as die area goes when dealing with FPGAs and ASICs. A good rule of thumb is 32 bits of logic roughly equals 32 bits of RAM. X11 plays towards the strengths of FPGAs and ASICs because it requires no RAM (and minimally, only flip flops for pipeline state management).
2. Designs that focus on branching also play towards the strengths of ASICs and FPGAs. ASICs and FPGAs evaluate branches in parallel. Choosing the next hash function to execute based on the current state of a hash doesn't impact performance since you can saturate both hardware branches by temporarily "saving" work in the pipeline, and then "releasing" work so that each subsequent hash unit is always fully occupied. This is easier than it seems because cryptographically secure hash functions are equally likely to take each branch (otherwise it wouldn't be cryptographically sound as it would be biased, which would lower entropy).
3. When engineering ASICs, going from most expensive to least: RAM > I/O bandwidth* > Compute
* I/O bandwidth is really a proxy for RAM, think DDR.
ANY design that relies on difficulty to code rather than cost to manufacture WILL fall to ASICs/FPGAs (i.e. X11).Recommendations:
Scrypt was a good start. Scrypt does the following things well:1. Focuses on the economics of ASICs and FPGAs (i.e. RAM > Compute)
2. Amplifies RAM cost with cryptographic operations between RAM accesses. Requiring 64 clock cycles of Salsa20/8 between memory accesses amplifies memory requirements since the memory is tied up until the hash calculation completes.
3. Uses
unpredictable state transitions to determine where to read next.
What Scrypt didn't do well and led to its downfall:1. Used
predictable state transitions to determine where to write next. Because memory was filled in order from 0-1023, it enabled a compute/RAM tradeoff where you could skip every other memory access and compute it instead. I.e. you could store values 0,2,4,...1022, (ignoring 1,3,5,...1023) and then if something tried to read position 5, lookup 4, and get 5 by taking 4 and running it though Salsa20/8 once. This significantly weakened the memory cost (this is what the --lookup-gap GPU miner setting does).
2. Used too few cycles between RAM accesses. Salsa 20/8 ties up memory for ~64 clocks per memory accesses. Increasing the number of rounds would increase memory requirements without impacting cache on CPU miners.
How to implement a serious CPU-only miner:1. Don't do something silly like X11 which only adds implementation complexity and no real ASIC cost complexity.
2. Fix predictable memory writing in Scrypt - right now Scrypt looks like:
for (i = 0; i < 1024; ++i)
{
RAM[i] = state // Memory access is sequential
state = Salsa20/8(state)
}
// BETTER fill unpredictable
for (i = 0; i < 1024; ++i)
{
RAM.Insert(state.e % i, state) // Memory access is unpredictable
state = Salsa20/8(state)
}
The 1 line change above causes computability problems that we have no good solution for. It adds a requirement for O(1) memory access and O(1) memory insert. The most well known solution for these requirements are hashtables, but hashtables work best when there are no collisions, and keys are immutable.
RAM.Insert() creates a worst-case for hashtables because each insert causes a collision, and modifies the keys of all items after the insert position. Alternate data structures offer even worse amortized performance (for example trees are O(log(N)) memory access and O(log(N)) memory insert).
Also - we are also performing a modulus with a variable denominator (well...incrementing), which can get expensive in hardware.
3. Switch from Salsa20/8 to something like Sha256 between Scrypt memory accesses. As mentioned above in Scrypt issue #2, increasing clocks between memory access has the effect of increasing FPGA/ASIC memory requirements without impacting CPU cache requirements. Also Sha256 is convenient because it uses 32bit internal registers which democratizes it to x86 and ARM devices.
4. Consider floating-point crypto functions; FP operations are very hard to do on FPGAs/ASICs because they require a significant amount of R&D or IP. If you coupled FP-crypto with the above, you'd be well on your way.
Also - I'm not saying everything here is perfect, but I'd love to have a serious discussion if there is interest. Finally, wrote this long post quickly, so please excuse any grammar errors / typos.
EDIT: clarity
.