Finished adding the other half of the block header not shown earlier to the output of the tests - putting in the final XOR with input block stage of Blake-256 that I forgot about when testing, in order to make it a full Decred process that will work, and porting tests to match. Also changed protocol to something more Icarus-like, in preparation for CGMiner support, and changed some stuff in the core transform, helping me drop slices, making the design smaller.
Previously, slice usage was ~1,300 out of the LX9's 1,430. Dropped it to 1,205 in this latest synthesis, but that isn't yet enough free logic to allow me to speed it up yet, I don't think.
Sounds promising, do you expect similar kind of speeds compared to other Blake algo coins on FPGA devices? I am also curious if you are using a more modern readily available FPGA device?