I did not look at the code but maybe you can clarify to me how this particular approach scales;
As mentioned, it scales linearly, but only in integer multiples. So an 80K device can get 80MH/s. A 160K device can get 160MH/s. A 240K device can get 240MH/s. But no in-between. At least, not without a different design.
Note that, the design in the repo is not optimized and so uses something like 90K LEs. An optimized design fitting into at least 80K will be released once I've finished it.
Correct. The EP4SGX230 flavor would get at least 160MH/s. The EP4SGX530 would get at least 480 MH/s.
I say "at least" because as far as I understand the Stratix series of devices have better timing than the Cyclone series, and so will support a much faster clock. If they are, for example, twice as fast then you can expect 960MH/s out of the EP4SGX530. However, I don't know for sure what clock rate they can achieve with the mining core.
In the FPGA though, you have to implement something to comunicate with bitcoind and, if using multiple devices, communicate with each one of them, right?
The FPGA requires a controller, and so is really just a dumb processor like the GPU. It performs the hashing algorithm, and that's about it. Like a GPU, there is a small memory space inside the FPGA that a controller must write the work to (through some external interface like SPI), and a memory space where results (valid hashes) must be read from.
The controller gives the FPGA a 256-bit Midstate, and 512-bit Data (which are acquired through a getwork request from bitcoind or a mining pool). The FPGA then proceeds to process all 2^32 variations and return any nonces that result in a valid hash. In that sense, it's exactly like a GPU where you give it data, and tell it to run 2^32 instances of the kernel.
The controller can be anything. A microcontroller like an Arduino, a microprocessor like an ARM, or even an entire PC like the one you're reading this post on
How would one scale this to multiple FPGAs? Some communication between devices would be needed, or will there be a full TCPIP stack communitcation with bitcoin on each one?
As said there are many approaches. One approach is to have the controller make a getwork request for each FPGA, so each FPGA gets its own data to work on and cycle through 2^32 times. This has the benefit of scaling easily, and not requiring traces on the board between the FPGAs (which would need to support high frequency data transfers). The FPGAs can just be put on a single bus, like I2C or SPI, and controlled by a single microcontroller or microprocessor (possibly embedded in one of the FPGAs).