In particular all currently popular hashing algorithms completely ignore the super fast floating point units in the GPUs and CPUs. Ultimately, some future generations of FPGAs will start including FPU blocks, very much like they started including DSP blocks years ago.
But currently the FPU performance gap is quite wide.
Picking winners in the ASIC game is relatively easy if one isn't afraid of occasionally changing the source code. When properly designed, it even doesn't need to be hard fork, just the version number of the PoW function needs to be explicitly recorded.
At the moment I don't have time to write a longer discussion, so for now I'll repost what I wrote in another thread. We'll see which of those new threads will get most intelligent discussion.
Bytom folks are a good example. Their goal was not to be general-ASIC-proof but to make sure that the ASIC that is fast at implementing their hash in their ASIC. So they wrote a hash function that uses lots of floating point calculations exactly in the way that their AI-oriented ASIC does. The hard part of understanding Bytom's "Tensority" algorithm is finding exact information about the actual ASIC chips that are efficient doing those calculations.
But the general idea is very simple: if you don't want your XYZ devices to become, play to their strengths in designing the hash function.
For XYZ==GPU start with GPUs strengths. I haven't studied the recent GPU universal shader architecture, but the main idea was to optimize particular floating point computation used in 3D graphics using homogeneous coordinates, like AX=Y, where A is 4*4 matrix and X is 4*1 vector
For XYZ==CPU made by Intel/AMD using x86 architecture, again start with their strengths. They have unique FPU unit with unique 10-byte floating point format and unique 8-byte BCD decimal integer format. Additionally they have dedicated hardware to compute various transcendental functions. So use a lot of those doing chaotic irreducible calculations like https://en.wikipedia.org/wiki/Logistic_map or https://en.wikipedia.org/wiki/Lorenz_system . Of course one could write an emulation of those formats using quad-precision floating point (pairs of double-precision floats), but it will take many months.
During those months you have additional time to research more strengths of your GPUs or CPUs. Use them in a hard-fork to assure that the preferred vendor of your mining hardware continues to be Intel/AMD/Nvidia.