Wow, Terasic apparently noticed this project and posted it on their Facebook page. Neato! And they also gave a shout-out to Bitcoin!
BTW, if you want to, you could host my VHDL version in your git repository as well, for the Xilinx users.
Sure, that'd be great!
Never ever. I'm if you mean XC6SLX150-3, this isn't going to cross 120MH/s, at least not with this design.
I've tried to synthesize a fully-unrolled (131 pipeline stages) design for that FPGA, and even though it fits (it doesn't on the LX100 variant), there seem to be difficulties routing it, and attempting to synthesize it for 120MHz didn't work out at all. The router is still running (since almost 48 hours now), so I don't even know which frequency it did reach yet.
You mention 131 pipelined stages. I assume you mean something like, 128 round modules, 2 final adder stages, and a third ... misc stage?
My estimate for the 6SLX150-3N comes from assuming that Xilinx's estimation on equivalent LCs is accurate and that the design would use a similar number of 4-LUTs and FFs on Xilinx, as compared to Altera.
The optimized design can fit into 80K LEs on a Cyclone 3, possibly 75K with some more cramming; at 80MHz and one cycle per full-hash. Assuming LE == LC that would allow two of those to fit into the 6SLX150-3N because it has the equivalent of 150K LCs. That would achieve a combined hash-rate of 160MH/s.
Obviously, you can see that my assumptions all come from "LE == LC." Indeed, when I've synthesized my design for the 6SLX150, utilization was roughly what I expected. But as you pointed out, routing is where I got stuck. ISE refused to route the design. That's on my list of things to do; figure out why ISE won't route the design, even though it will route a cut-down design (only implementing a few rounds instead of all 128) just fine, and utilization is <70%.
Could you provide some details on your optimizations? Is the optimized verilog code available somewhere?
The code isn't on the public repo yet, no, because it's a work in progress and I haven't taken the time to put it up yet. It's on the list of things to do
There are four classes of optimizations:
1) The last 3 rounds the second SHA-256 pass are not needed. You only need to check that Round64.H is equal to 0, and the last three rounds do not affect H. Here's the math:
// Round numbers are 1 based, so we go from Round1 to Round64
Round61.E = Round60.D + Round60.H + s1 + ch + k[61] + w[61]
Round62.F = Round61.E
Round63.G = Round62.F
Round64.H = Round63.G
Let's simplify:
32'h00000000 == Round64.H + InitialState.H == Round63.G == Round62.F == Round61.E == Round60.D + Round60.H + s1 + ch + k[61] + w[61]
32'h00000000 == Round60.D + Round60.H + s1 + ch + k[61] + w[61] + InitialState.H
32'h00000000 == Round60.D + Round60.H + s1 + ch + 32'h90befffa + w[61] + 32'h5be0cd19 //K and InitialState are known
32'h136032ED == Round60.D + Round60.H + s1 + ch + w[61]
That allows you to remove 3 full rounds and 9 adders.
2) Pre calculating the first few stages of the first SHA-256 pass is also possible. The first pass is calculated from the 512-bit DATA string that getwork requests give us, and the MIDSTATE. Between pieces of work, all of that data remains constant, except for the nonce, which we are increasing. The nonce is at W[3] (0 based), and W[3] isn't used until Round4. So, you can run the first 3 rounds of SHA-256 on a controller (PC, microcontroller, whatever) before handing the "work" to the FPGA. The FPGA then picks up where you left off and never has to calculate those first 3 rounds.
That amounts to giving the FPGA Midstate, Data, and Midstate', where Midstate' is the 256-bit state as of your pre-calced Round3.
3) Quite a large amount of W can be pre-calced off the FPGA as well. Again, Data is constant except for the 4th word which is the nonce. I won't go into the math here, because it's rather long. The first 16 values of W are sourced directly from Data, so that's a no-brainer. However, also note that all of Data after the first 4 words (after the nonce) is constant for all getwork requests! So you can actually hard code those values as they are in the code on the public repo. What that code doesn't do, however, is just add K with the known W values, thus saving you an adder for the first 16 rounds.
After the first 16 rounds, everything up to round 35 (I think) has some amount of W that can be pre-calculated off the FPGA and given to it with the rest of the work data.
4) The final optimization is this:
t1 := h + s1 + ch + k[i] + w[i]
Having to add K and W makes the adder tree larger. K is a constant and always known. W can be calculated at any point before a given round. H is also known at least one round ahead of time (as you've learned in the first optimization). So it's possible to do this calculation in the previous round:
pre-t1 = g + k[i+1] + w[i+1]
and then in the next round:
t1 := pre-t1 + s1 + ch
That shrinks the critical path down, allowing for higher clock rates. Note that I haven't implemented that particular optimization, so double check that I did my math right.