I have created a new proof-of-work for use in BitShares that I hope is entirely GPU proof and ASIC resistant while being as fast as possible to verify. I am working to secure an additional 60 BTC to back this challenge.
The algorithm that I am using is posted on GitHub here:
https://github.com/bytemaster/bitshares/blob/master/src/proof_of_work.cppThe bounty can be won if you can get any GPU to perform 25% faster than the most optimized CPU code I can produce running on the fastest CPU I can find. The test will be based upon the total number of unique hashes that can be performed in 10 minutes.
This proof-of-work is computationally difficult even for a single hash,
but must be so to prevent optimizations to the required memory foot print.
The maximum level of parallelism achievable per GB of RAM is 8, and the highest
end GPUs now have 4 GB of ram which means they could in theory support 32
parallel executions of this proof-of-work.
On GPU's you only tend to get 1 instruction per 4 clock cycles in a single
thread context. Modern super-scalar CPU's can get more than 1 instruction
per clock cycle and CityHash is specifically optomized to take advantage of this.
In addition to getting more done per-cycle, CPU's have close to 4x the clock
frequency.
Based upon these characteristics alone, I estimate that a CPU can execute the
serial portions of this algorithm at least 16x faster than a GPU which means
that an 8-core CPU should easily compete with a 128 core GPU. Fortunately,
a 128 core GPU would require 16 GB of RAM. Note also, that most GPUs have
less than 128 'real' cores that are able to handle conditionals.
Further more, GPU's are not well suited for branch misprediction and code
must be optimized to avoid branches as much as possible.
Lastly this algorithm takes advantage of a hardware instruction that is
unlikely to be included in GPUs (Intel CRC32). The lack of this hardware
instruction alone is likely to give the CPU an order of magnitude advantage
over the GPUs. That said, the algorithm does not depend solely on the performance of this instruction.
I am speculating that the number of CPU cores will grow at about the same rate as available RAM in GPUs and therefore GPUs will never gain the upper hand over CPUs for this hash.
I am also interested in any feedback on how to improve the algorithm further, make it run even faster on CPU's without compromising GPU difficulty. It currently takes about 0.75 seconds on a 2.3 Ghz Core i7 Macbook Pro for a single hash.
*edit* If you intend on tackling this challenge and actually writing the code to prove it, please contact me so that we can have a meeting of the minds regarding what is required and we can lock in the algorithm.