If you've never done hardware design before, this might be a bit confusing, but everything that's written in that HDL file will happen in parallel, not sequentially as it would be in most programming languages.
In the loop, the previous state and data are copied to state_buf/data_buf. In parallel, the (old) contents of state_buf/data_buf are used to calculate the next state/data value.
Because of this, it will take two clock cycles for the values from S[i-1].state to propagate (though state_buf) to state. The generate loop basically just duplicates that code 64 times, but has no effect on "execution order", if there even is such a thing in HDL.
Thanks! I've done hardware design before, but only very simple circuits in HDL. I've always been an oldschool schematic/block diagram guy, did some HDL way back, but only simple stuff, then haven't touched it since. So getting back into it now. As I said I've been writing my own bitcoin mining core, which can be hopefully synthesized for multiple boards and wrapped in whatever PC comms layer we want. But it's slow going lol...
Thanks for pointing that out, I had missed that double stage assignment. That's what I was looking for and just not seeing it. (I do know how blocking versus non blocking assignments work though lol)
Thanks for taking the time to answer that.
(interesting that so far my design doesn't have this extra stage in it, my SHA core is done, and I'm just building the testbenches for it now to validate it. But my SHA core (I believe) runs in 64 clock cycles (probably 65-66 due to initial loading logic, I'll have to doublecheck). This is purely un-optimized right now, for now I'm just getting a working SHA core and then building a bitcoin core, and finally I'll go back and tune/optimize. (right now I'm at about 50% utilization on an LX75 but my delay on my critical path is high, 11ns, so I'm limited at just under 100MHz, I'm getting one SHA hash per clock. So of I can get that to 100MHz initially, and can cram 4 of these cores into an LX150 I can get 2 bitcoin hashes per clock, at 100MHz. We'll see how it validates on the testbenches though, and if I'm able to optimize it (and how well). I'm hoping to opensource this, but want to get it to a working state first (a little embarrassed to release it in it's current state lol). Then hopefully the community can optimize further. I'll probably just release it and put up a donation address or something.
I know the LX75 is over-constraining and tends to screw up routing so I'm targeting it first as a "stress test", then once I get the design working on that I'll move it to the LX150 and see how it goes.
Also my design was is a "clean room" implementation of SHA256. I have gotten "tips" by a few on the forums here though for optimization methods. And I have looked over the ZTex code of course, but frankly I found it hard to read in places. So I figured re-implementing it would be a good learning experience to get my Verilog skills polished up anyway. I wrote it directly from the SHA2 spec without having the ZTex code open. (it's as cleanroom as you can get these days lol).
Right now I'm running into issues with the Xilinx simulator. It's being bitchy about simulating my code (even though it synthesizes fine), which is why I haven't completed a testbench sim of it yet. Also getting a lot of warnings about unconnected nets in synth, but that's because the top level module (bitcoin hashing core) isn't done yet. Just the lower level SHA core.