IIDX,
The addressing would be constant, so no decoding would be needed. They would be tied off to constants.
The 2.0ns is the clk-to-out time for a data output. Since all outputs are in parallel, (each BRAM configured as x72, and grouped together to give very wide access),the individual BRAM bit delay would not change. No demuxing of outputs would be necessary.
The number of BRAMs needed is only half what you show, since you can use both sides (Port A & Port B) independently (assign each side a fixed, but different address).
Yes, you are right though re getting the data from the BRAMs to the LUTs needed for the computation. There is a routing delay which is probably too large.
Obviously this is not the optimum solution, only bringing it up as a last resort if available flip flops have expired.
Regards,
ihtfp
I think the problem is linking 11 BRAMs together requires a lot of LUTs for address decode/routing since the BRAMs are arranged in columns throughout the chip. Plus linking 11 together would probably result in a minimum period much higher than 2.0ns (2.0 ns is for 1 BRAM I think).
So, you would need 128 (hashers) * 11 (BRAMs) for one pipeline stage = 1408 total BRAMs. Of course, you're not suggesting you use BRAM for all the delay. However, I think the slices you would sacrifice to connect the BRAMs and create their address logic would be more expensive than just using the built in FFs or DMEMs (plus the speed hit).
I'm hoping by floor planning each hashing module I can get to quick speeds. Currently the logic delay I am facing is only around ~2.0 ns, with the routes taking the rest. So with some nice routing I would hopefully meet my target.
The V6LX130 isn't even as big as the S6 150, but at least is has DSP48s.
I may also need to cut down the PCIe link from 4x to 1x and reduce its performance settings to regain some of the space that is being used up.
IIDX
Looks good! I tried to do the same thing on a V6 LX130T (use almost all DSPs and pipeline the rest of the LUT adders), but there aren't enough registers in that device for tx_w and tx_state delays
. so many 512 and 256 bit registers...
If you are short on flip flops, have you considered using the BRAMs? You would need 11 primitives (there are 264 in the LX130T) to make a by 792 bit wide memory. You can set the BRAM to 'write first' mode, which will echo the data to the output. The clk-to-out for unpipelined BRAM is ~2.0ns...slower than FF.
Since the BRAMs are dual port, you can use both sides of the memory (with different locked addresses), you can get enough storage for 48 stages of a fully unrolled algorithm.
I've never tried this, but was just thinking of how to make use of all the unused BRAM laying around. I usually run out of LUTs, but need to rethink if this is worthwhile with the DSP48 implementation.