Yes, current FPGA designs are fully pipelined, as long as they *fit* into the FPGA, and thus you get a hash rate of 200 MH/s at a clock frequency of 200 MHz. And it's not 1300 cycles, but literally 122 (or 128 or something like that).
Lets not confuse unrolling with pipelining. Current open-source designs are fully-unrolled, but I have yet to see a
propergeneral pipelined design that is open-sourced. Maybe
a different scheme of pipelining
and unrolling is what BFL did? By
flexibly pipelining the unrolled design they could significantly crank up the clock, since the FPGAs are limited more by the propagation delay in the signal routing than in the propagation delay in the actual logic.
As I asked before, do they execute in a manner like dominoes where the clock process advances data through the FPGA in steps?
No, this is a horrible analogy.
To understand the unrolling as applied to the logic design you need to understand the difference between combinatorial and sequential logic.
In combinatorial logic the outputs are simply a function of inputs.
In sequential logic the outputs are a function of inputs and the internal state. All the current FPGA hashing appliances use a variant of sequential logic called synchronous sequential logic: there is a dedicated clock input and a change on the clock is when the internal state gets updated.
Fully unrolled (128-way or 125-way) Bitcoin hash means that the logic that computes it is fully combinatorial, there is no internal state used inside the cascade of the two SHA246 hashers. The clock is still used in the fully-unrolled design: to increment the nonce counter and to sample the zero-comparator at the output of the hasher.
64-way unrolled Bitcoin hash means that there is one internal state register that stores the intermediate state. During the odd clock cycles it does single SHA-256 of the input (midstate and nonce) and stores it in the internal state. During the even clock cycles it does single SHA-256 of the internal state and presents it on the output. This cicrcuit is only about half the size of the above circuit.
Now fully unrolled and pipelined design would be about the same size as the above fully-unrolled design but it would have some internal state registers. If there is one level of pipelining then in each clock cycle first half of the circuit would compute single SHA-256 for nonce "N" and the second half of the circuit would compute single SHA-256 for the nonce "N-1".
In unpipelined design the input signal have to race through full 128 (or 125) rounds of SHA-256 to the zero-comparator. With one level pipelined design the the inputs have to race through 64 rounds of the first SHA-256 to the internal register simultaneously with another set of signals racing through 64 (or 61) rounds from the internal register to the zero-comparator. Simplistically one could say that the clock rate on this design could be almost double of the clock rate of the non-pipelined design.
Again, I haven't seen anyone publishing open-source designs that are
both independently unrolled and pipelined. I think this is due to the limitations of the FPGA synthesis tools. They either require tens or hundreds of GB of RAM or months of CPU time.
Anyway, kano, I suggest that you throw away your domino set. Get yourself a free version of Xilinx ISE or Altera Quartus and use them to play FPGA design game. It is like playing Tetris, chess & contract bridge all on the same board.