Of course Avalon's logic is secret, but I'm going to discuss the problem based on one of the open-source FPGA hashers. It had a critical timing path in the logic that latched the "golden nonce". Since the design was 125-deep pipelined it had a hardware that subtracted constant 125 from the nonce counter before sending it out of the chip.
Now we have two ways to speed up the above design:
1) remove the 32-bit wide constant subtractor. This will gain a fraction of a nanosecond on every hash tried. It is very easy to subtract 125 in software from the nonce downloaded from the chip.
2) acknowledge that the timing violation may occur and the nonce latched may not be the exact one that solved the block, but a next one or previous one, depending on the details of the latching logic. It is somewhat more involved, but still easily doable in software: recompute the hashes for nonce values n-126,n-125,n-124 and use the one that solved the block. Again this will make the design more tolerant to overclocking for every hash tried inside the chip.
Obviously 1) cannot be applied to the ASIC chip or closed-source FPGA bitstream. But the method 2) remains applicable, just use a different set of test values.
@hardcore-fs
Please read the context in which things were said:
The 32-bit wide constant subtractor in that design limits the whole speed of the design, you can speed up the whole design by removing that subtractor and simply deduct the constant later from the "nonce" received from the chip to get the "golden nonce"...
It's simply about gaining some overhead, the weakest link breaks the chain.
If you can gain even more overhead by assuming that the latched nonce in the chip is not the "golden" one but very close by as stated by him, a few nonce validity checks nearby will finally reveal the "golden nonce" this way you can push the chip in terms of clock and internal timings to its limits and even a bit beyond.
If this can improve the maximum speed that can be reached for the device significantly - in exchange for a bit of insignificant cpu time every nonce found (we talk about crosschecking only a hand full of nonces here vs. the workload, and that roughly every 10.7 seconds for a 200 mhz core) than yes - it's an acceptable way.
You mean like adding lubricant to your tiers so you can go down hill faster.
You need grip to make use of the car's engine, no point.. but IF the task is to get the car down the hill the fastest way possible with engaged breaks, without the need of being able to stop it and all you have is an unlimited supply of lubricant - than yes adding lubricant to both the street and the tiers to accomplish the task is the way to go.
You sir are a fucking idiot.
FPGA's process in true parallel.
I can process thousands....(nay tens of thousands) of 32 bit subtractions in an FPGA, before you have even fucking read the numbers into your CPU registers.
Posting insults & very basic / unrelated facts (of actually any logic or programmable logic) doesn't help here.
"In the land of the blind, the one-eyed man is king"
For interest take a look at one of the ASICS floating about, they have given a proposed pinout showing 8 data lines and some strobes.
WTF.... even the nonce will require 4 CLK cycles just to get it out of the chip and they are claiming this design is good into the GH/S range?
Here we go
a "truly parallel" 8bit Data-bus
Of course it's good into the GH/s range, the traffic is low since only the results ("golden nonces") need to be collected, everything else gets discarded already in the chip... The only way to make it faster than it is right now would be having a 32bit databus to get the whole nonce out of the chip in one CLK cycle, would it matter? no... waste of resources and space, 24 more pins/tracks to deal with for no real benefit... the same is true for getting the "work" to the chip of course (which is of course more than 4 bytes...)
Edit: Should be said for the sake of completeness, those 4 clock cycles needed to collect the nonce will be from an external controller and are not directly related to the internal clock used by the hashing chip, further, the clock for collecting the data will be slower than the internal clock used by the hashing chip. Changes nothing about the situation though.