Thank you for providing an important puzzle piece on how Dr. Tyrell does it.
The multiplier in the DSP48-block is not needed in SHA-256, hence what he obviously uses is the 18-bit adder
BCOUT = B + D.
He uses 30 DSP blocks, 10 per red / green / blue SHA-256 instance.
For a 32 bit adder, two 18-bit adders BCOUT=B+D are needed.
Thus, he can implement five 32-bit adders per SHA instance.
So, why not just use [slow] 32-bit ripple adders everywhere, and use a few [very fast] DSP adders in some places?
The answer is, IMHO, that he uses the fast DSP adders only where they feed into longlines.
Were he to use normal ripple adders where he feeds into longlines, the aggregate delay would limit
the design to a 5 ns clock cycle.
Using the fast DSP adders will allow this design, when properly fine-tuned, to march into 4 ns clock cycle
territory, for a total MH/s number of approximately 125 MH/s or approximately 375 MH/s per Spartan6-150.
BFL Single, watch out below.
I remember nghzang mentioned that going to 200MHz on chips was not suggested (chips got so hot), and he gave
out a bitstream with a "Use at your own risk". Three loops on the same chip suggests far greater number of
Registers is being used. Since each stage toggle rate approaches 50% (This idea behind Digest functions is that their toggle-rate
must approach 50% in each stage to be effective, and so is the case in SHA256), I wonder how hot the chips will get in high
frequencies, approaching 180MHz or 190MHz...
Good Luck,