The optimal solution will be, depending on the layout of the chip, partly unrolled partly not, depending on which place inside the chip the operation takes place. It isn't really proficient to even talk about unrolling since depending on the amount of elements which can be used for memory, logic and routing a particular approach will be best.
Of course computing a solution which fits this paradigm would probably be too much, however if the tools permit programming the FPGA on a low level it should be at least possible to use close to 100% of all resources for something useful.
As far as I understand, loops that are not 100% unrolled incur inefficiencies, as partial results have to be fed back in and there has to be logic (multiplexers) to do that, whereas data in a fully unrolled design just percolates from start to finish.
Not if the looping is free. If there are abundant local resources at a specific place because of chip layout constraints the penalty used for looping and memory would decrease up to the point where it would be more efficient. But to be honest I have no idea how this should be implemented and don't even know if it is at all possible with available tools.
Nevertheless I think it is worth a thought and if FPGAs prevail over a long period as the status quo something will come up. I am certain of that.
Another speculative note: the I/O ports of the FPGA might eventually be used to obtain additional routing, this would of course restrict the layout but it should be worth it for cases where a slow rate data stream would consume unnecessary resources if routed over wide distances.