Mine along on your CPU if you wanna make up the difference and then some.
The issue isn't really about losing one in billions of hashes. It is about gaining the timing margin (a.k.a. overclocking headroom) in the design.
Of course Avalon's logic is secret, but I'm going to discuss the problem based on one of the open-source FPGA hashers. It had a critical timing path in the logic that latched the "golden nonce". Since the design was 125-deep pipelined it had a hardware that subtracted constant 125 from the nonce counter before sending it out of the chip.
Now we have two ways to speed up the above design:
1) remove the 32-bit wide constant subtractor. This will gain a fraction of a nanosecond on every hash tried. It is very easy to subtract 125 in software from the nonce downloaded from the chip.
2) acknowledge that the timing violation may occur and the nonce latched may not be the exact one that solved the block, but a next one or previous one, depending on the details of the latching logic. It is somewhat more involved, but still easily doable in software: recompute the hashes for nonce values n-126,n-125,n-124 and use the one that solved the block. Again this will make the design more tolerant to overclocking for every hash tried inside the chip.
Obviously 1) cannot be applied to the ASIC chip or closed-source FPGA bitstream. But the method 2) remains applicable, just use a different set of test values.
Since it's a pipelined design, wouldn't removing the subtractor just reduce the latency of the pipeline instead of increasing the throughput?
Even if this subtractor would prevent the re-loading of the pipeline than you could pipeline the pipeline and the subtractor.
Since the pipeline will not (i presume) produce a nounce to be latched on every clock you have more than enough time to store the previous nounce on chip and subtract the number before sending it out to the controller.
At least i would make my 'store' circuit parallel to the actual pipeline so it can operate asynchonously.