I’ll see if I can dig up recent ones. A lot of people pull up the old CUDA vs FPGA academic papers that are focused on very old architectures.
Thanks in advance.
I'll put the blame squarely on the vendor's lap. Intel which now acquired Altera still lists "An Independent Analysis of Altera’s FPGA Floating-point DSP Design Flow" from 2011 as the only source mentioning "accuracy". I've found several other, newer papers; but they all repeat the old bullshit methodology: only using single-precision and only estimating the errors. At most they'll show fused-multiply-add like if double precision or
https://en.wikipedia.org/wiki/Kahan_summation_algorithm never existed, or didn't apply.
As to GPU floating point performance, you don’t need a benchmark. The figures are right in the ISA documents. Single precision TFLOPs are usually given in terms of FMA unit operations though, which is a bit misleading.
The FPGAs are a bit harder to get TFLOPs numbers for given the flexibility, it since most of the performance actually comes from the DSP blocks you can calculate those. If you’ve never read them Xilinx gives extremely detailed performance metrics for every chip for most IP blocks, as well as frequency numbers for the hard blocks in the AC/DC switching characteristic docs. Agner Fog publishes a very detailed set of specifications for the performance of those units on most every CPU/APU available as well.
The funny thing is that the closest to honest comparison of Xilinx's FP I've found on the Altera's site:
https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01222-understanding-peak-floating-point-performance-claims.pdfThe main resource CPUs and GPUs have is instruction flexibility. Until a PoW hash truly requires most of the full instruction to be supported to implement it will be hard to keep out ASIC/FPGA.
I think this claim is true, but somewhat pessimistic. I think it would be fairly easy once wider range of cryptocurrency programmers start to appreciate floating point and
https://en.wikipedia.org/wiki/Chaos_theory as an useful building blocks for the proof-of-work algorithms.
I've only skimmed the currently available literature on the subject, but it is next to trivial to demolish all the current claims of FPGA superiority that I was able to find today:
1) use double precision
2) use division or reciprocal (either accurate or approximate)
3) use square-root or reciprocal square-root (either accurate or approximate)
and I haven't even gotten into transcendental functions (on CPUs) or using later, pixel-oriented hardware in the shaders (on GPUs).
You did, however, motivated me to reconsider Altera/Quartus for certain future projects. They are now shipping limited, but fully hardware implemented single-precision floating-point in their DSP blocks and their toolchain had improved in terms of supported OS-es/device-drivers.
I deal with a lot of complex, large FFTs on CPUs, GPUs, and FPGAs. The “only using single precision” is unfortunately true of every vendor - GPU and FPGA. Marketing wants to use the big number - and frankly so do most real world users now. Modern GPUs are horrible at double precision. It is a sad fate. Your comparison also compares a modern Stratix 10 (10 TFLOPs) to the previous generation Ultrascale (not Ultrascale+) with slower fabric and significantly fewer DSP blocks compared to the VCU1525 (XCVU9P-L2FSGD2104E) everyone has been talking about here.
Compared to even modern weak DP GPUs any normal priced CPU is horrible at double precision. A modern GPU runs circles around the Complex FFTs using double precision vs a CPU. Both become quickly memory bound. The FPGA performance is usually on par or slightly better for the double precision, but the benefits in the rest of the calculation are much better. I think you’ll be hard pressed to build a hashing algorithm that is entirely Floating Point like a synthetic benchmark.
The only place FPGAs really fall down is upfront cost.
I’m still a bit confused by why you think sqrt/reciprocal, and the transidentals are so difficult for FPGA, or that’s they are magically free on GPUs/CPUs. On at least AMD GPUs these
are macro-ops that take (100s of clockcycles) EDIT: searching for my reference on this, I see these ops are quarter rate. May have been thinking of division) . On the FPGA you can devote a lot of logic to lowering the latency on these functions, or you can pipeline them nice and long with very high throughput to match what you need for the algorithms in question. You have none of that flexibility on the GPU. What you do have is a tremendous amount of power and overhead in instruction fetching, scheduling, branching, caching, etc. to a limited set of ports to implement the opcodes for each GCN/CUDA core.