I still believe I can make a big performance jump over the code. I will try to get down to the gate level as much as possible and use all the logic there. I have even been looking through the specs and schematics to see how the slices work on the Spartan-6.
It's a lot of fun down there! It's a shame the Spartan 6 architecture is so limited. I suggest you take a look at the 7-series FPGAs, like the Kintex or Artix. The architecture is nicer, and performance is much higher. For example, I was able to implement a miner using the DSP48E1s on a Kintex.
Also, have you looked at bitfury's code? He has the most performant code for Spartan-6 LX150 chips, and I would be shocked if anyone beat his record (in MH/s) on that chip. It's optimized down at the slice level and manually placed. https://bitcointalk.org/index.php?topic=228677.msg2417706#msg2417706
Unfortunately, or fortunately (depending on how you look at it), FPGA's will never beat ASICs in terms of performance per dollar, or performance per Watt. So FPGA mining is a curiosity and plan-B sort of thing now.
For sure it is possible to implement on FPGA.
I coded up a quick SL3 cracker about a year ago. It either ran on my Spartan 6 devkit, or the X6500, I can't recall. I could probably dump the code to github if people are interested. I didn't optimize it particularly well, just got it working.