Yeah, I would definitely pipeline the data. Having a long combinational path will make it really slow. This is where the fun stuff happens. You have to figure out how many stages are the best. Then adding in memory makes it more of a challenge. I wonder if you can run the ram 2x as fast and have 0.5Mbits instead of 1Mbits. This will save some space, plus you will learn how to deal with different clock gearing. But I don’t know if that is possible or not, and you need the 1Mbits.
No worries on the code. I am glad to see it is written in Verilog (more personal preference than anything). We all have to start somewhere.
My live code for my DE0-Nano board is actually using a 0.5MBit scratchpad (the EP4CE22 chip only has 600kBit ram), so I have to interpolate the missing half of the scratchpad (basically one extra pass through the salsa-mix for half the addresses). I can't really see how I can parallelise this as each ram read address depends on the results of the prior salsa-mix (scrypt was explicitly designed to be awkward to parallelise). From some reading I've seen that a larger scratchpad eg 8MBit can speed up the scrypt, which I don't quite understand yet, so once I've read up some more on it then perhaps some tricks will be apparent (yeah, I just wanted to get something running quickly so I coded a direct analog of the cgminer CPU scrypt.c code, rather than doing my research first )
The pipelining of the salsa-mix is definitely an issue, but its tricky due to the dependency of the scratchpad reads on addresses generated from the prior salsa. Adding extra register stages allows a faster clock, but needs extra clock cycles to complete, completely cancelling out any gain (and there is no gain from the pipelining itself due to the address dependancy). Once I understand the algorithm better, perhaps I can come up with a solution (I'll need to take a look at the CUDA code to see what's done on the GPUs).
Thanks for the kind words, and good luck.