No problem, it wasn't the most brilliant question.
Anyways, here are the stats we have cooking, no screenshots yet, this project is requiring more math than I might have liked. One of the ways I'm optimizing the mining is checking late-round values against (to be determined, but known) constants to determine whether or not they will (or are likely to) yield a win. If not, the SHA algorithm aborts early, saving resources. It's gonna be a lot of work, and that's a big way of how we will be shrinking approximately 60 MH/s (number based on more recent data) onto a Cyclone IV Nano. The work begins today.
SHA-2 hashes are unpredictable at 128 rounds, or 64 rounds, but if we have access to the data all the way through, and know what our starting and end data should look like, we can side-channel it. I'm speaking to our school's cryptanalysis expert.
The above may sound like heresy but block ciphers are weakened by attacking their implementation, and we have full access to this one. I'm going to keep working, on all fronts of optimization.
Donation-wise, there have been about 45 BTC and $50 USD/CAD donated, allowing me to buy a couple of DE0 Nanos from our friends at Terasic and paying for a bit of the countless hours I've been pouring into learning all of this. Hopefully that's something you're all happy with.
wondermine, I wish you the best. I really do.
However, please take a look at the SHA-256 algorithm.
http://en.wikipedia.org/wiki/Sha-256The 32 bit values b, c, d, f, g, and h are trivially derived from the previous round, i.e. copied from a, b, c, e, f, and g, respectively.
The 32 bit value e is derived from the previous round's d, h, e, f, and g (i.e. 5/8th of the previous round's 256 bits are used to derive it).
The 32 bit value a is derived from the previous round's h, e, f, g, a, b, and c - i.e. 7/8th of the previous round's 256 bits are used to derive it.
Now think this through over just one more round. Only four 32 bit values are trivially derived from their "grandfather" round.
The other four 32 bit values are derived from brutal mixing of almost all bits of the grandfather round.
And so on.
After just 4 rounds, a single bit change in the great-great-grandfather round influences ALL bits of the current round.
Thus, any notion of shaving more than 4 or 5 rounds off the 128 total rounds is a pipe dream.
In other words, speeding up an implementation of SHA-256 cannot be done by mathematical tricks.
Rather, the operations of each round should be optimized.
There is no real reason why the clock is a measly 200 MHz (and thus the clock cycle 5 ns) in the best currently available implementations,
such as the ZTEX implementation. Think about it: 5 ns, that is a delay straight from the 70s. A TTL technology-like delay. Certainly we can do better than that?!?
Analyzing the operations for their contribution to the delay yields:
rightrotate ... instant, no delay at all
xor ... minor delay, bit by bit, probably a few dozen picoseconds
and ... minor delay, bit by bit, probably a few dozen picoseconds
not ... minor delay, bit by bit, probably a few dozen picoseconds
+ ... this should be scrutinized. A 32 bit add operation can be quite costly and the fastest possible implementation should be pursued.
Adding insult to injury, SHA-256 features not just binary or ternary adds, but 4-fold adds (in the t1 function) and 5-fold adds
(e := d+t1) and 6-fold adds (a := t1 + t2).
So, there you go. The biggest detriment to performance is probably the 6-fold 32 bit wide add in a := t1 + t2.
If you can speed this operation up, maybe by pre-computing partial results in the PREVIOUS round, then bringing them to the table when needed, the entire SHA-256 will be sped up (assuming optimal placing and routing).