Current Icarus code (at least the released stuff) is based on the ZTex code.
The ZTex code has a central core module which has a variable number of stages. The SHA-2 (SHA256) spec calls for 64 stages per hash. But the way bitcoin uses it, it only needs a full hash on one stage, and a partial hash on the other.
So the ZTex code (and therefor the Icarus code) does 64 stages on one core, and 61 stages on the other core, for a total of 125 stages. It has all of those stages fully unrolled so it takes 125 clocks (probably slightly more, haven't looked at the UART code, and controlling logic in depth yet) to fully load the pipeline, after which it runs 1 hash per clock once the pipe is loaded.
...
Ignoring the pipeline question I asked, I hope it doesn't do 64 + 61.
(well actually I should say I hope it does do this coz then there is a speed up still available)
The 2nd sha256 is actually just 60.5 - but that is probably what you meant by 61.
The 1st sha256 is 61 also - the first 3 'stages' are exactly the same for all nonce in a range, so repeating them 4 billion times is a waste.
There is also the nonce-constant values of W0-W2 & W4-W15 (W4-W15 are constant over all time)
Then the calculation of W16, W17 is also constant across the nonce range.
(and there are other partial calculations you can do also that are constant across a nonce-range)
Edit: the partial ones are W18 (S0), W19 (S0 and S1, S1 is a constant over all time) W20 (S1 - again a constant over all time) W21 (S1 = 0) W22-W30 (S1) all these partial calculations shouldn't be done 4 billion times if at all possible (and some of the +W values for these are also constants per range or even constants over all time)
Edit2: I wrote a C program many months ago to analyse the double sha256 and optimise it (and spit out an optimised C program to calculate it - that works) and that's where I get that info from - but I know it is correct coz - as I said, the output code works.
I did this for my own understanding of what optimisations there are ... and of course found them all for the normal double sha256
If you could actually fit in doing 2 nonce at a time in one chip there are also some more partial calculations across each pair of nonce (that I started working on with my code but didn't finish due to there being no actual use in the results at the time)