Regarding Performance Optimizing Spartan-6 LX150:
DANGER: Long detailed post coming... sorry, I hope the information is useful thoughI'm working with my LX150_makomk_speed_Test project, where I'm trying to nail down the performance bottlenecks and remove them. I'm learning up FPGA Editor so I can better visualize what the router is doing, and I've read through some of the Spartan-6 UGs to get a better understanding of the architecture.
First off, I will say I am quite
impressed by Xilinx's work and foresight on the logic of the S6's slices. They can perform a 3 component, 32-bit addition in 8 chained slices, with 4-bits being computed per slice. That blew my mind when I saw it in the FPGA Editor. This is great for our mining algorithm, and you can see why in this critical path analysis:
W: 16 slices + 0 slices = 16
tx_t1_part: 8 slices + 0 slices = 8
t1: 8 slices + 0 slices
tx_state[0]: 8 slices + 8 slices = 16
tx_state[4]: 8 slices + 8 slices = 16
The worst critical paths are only 16 slices long, with a single break in the carry chain (AFAIK). W is a 4-way, performing a 3-way of the first 3, and a 2-way of the result and the remaining component. tx_state[4] is a 2-way with t1 and rx_state[3].
I haven't fully analyzed the router's behavior on the 2-way's yet, but it
appears to include work from other operations ... somehow. Not sure yet.
So, that's the good news. The bad news is, of course, only half of the slices are useful. There are two slices in a CLB. One slice always has fast-carry logic and chains to the slice directly above it (in the CLB above it). The other slice is a lowest form of life slice. It's still a powerful slice, with 4 6-LUTs (or 8 5-LUTs, or combinations thereof), and 8 flip flops, but the mining algorithm has rare use for it.
The next bad news is, only half of the "good" slices can be used as RAM or shift registers. That's not a terrible thing since most will be consumed as adders anyway.
And that's about all I could find that's particularly good or bad with the S6 slices. Since the good slices are all in columns, and spaced evenly, the impact of the useless slices should actually be far less severe than I thought.
For the S6's routing architecture, the quick overview basically said routing costs roughly Manhattan Distance between CLBs. I haven't dug into the details more than that at this point.
With that knowledge in hand, and some beginner's experience with FPGA Editor, I dived in and found what appears to be the largest bottleneck in the current code:
Slack (setup path): 0.264ns (requirement - (data path - clock path skew + uncertainty))
Source: uut2/HASHERS[41].shift_w1/Mram_m (RAM)
Destination: uut2/HASHERS[41].upd_w/tx_w15_30 (FF)
Requirement: 12.500ns
Data Path Delay: 11.716ns (Levels of Logic = 6)
Clock Path Skew: -0.260ns (0.780 - 1.040)
Source Clock: hash_clk rising at 0.000ns
Destination Clock: hash_clk rising at 12.500ns
Clock Uncertainty: 0.260ns
Clock Uncertainty: 0.260ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
Total System Jitter (TSJ): 0.070ns
Total Input Jitter (TIJ): 0.000ns
Discrete Jitter (DJ): 0.450ns
Phase Error (PE): 0.000ns
Maximum Data Path at Slow Process Corner: uut2/HASHERS[41].shift_w1/Mram_m to uut2/HASHERS[41].upd_w/tx_w15_30
Location Delay type Delay(ns) Physical Resource
Logical Resource(s)
------------------------------------------------- -------------------
RAMB16_X2Y46.DOA2 Trcko_DOA 1.850 uut2/HASHERS[41].shift_w1/Mram_m
uut2/HASHERS[41].shift_w1/Mram_m
SLICE_X60Y126.A2 net (fanout=4) 5.845 uut2/HASHERS[41].cur_w1<2>
SLICE_X60Y126.COUT Topcya 0.379 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<19>
uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_lut<16>
uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<19>
SLICE_X60Y127.CIN net (fanout=1) 0.003 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<19>
SLICE_X60Y127.BMUX Tcinb 0.292 uut2/HASHERS[41].shift_w0/r<27>
uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<23>
SLICE_X78Y122.B3 net (fanout=1) 1.995 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd_211
SLICE_X78Y122.BMUX Tilo 0.251 uut2/HASHERS[41].upd_w/tx_w15<23>
uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd221
SLICE_X78Y122.C5 net (fanout=2) 0.383 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd221
SLICE_X78Y122.COUT Topcyc 0.295 uut2/HASHERS[41].upd_w/tx_w15<23>
uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_lut<0>22
uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>_22
SLICE_X78Y123.CIN net (fanout=1) 0.003 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>23
SLICE_X78Y123.COUT Tbyp 0.076 uut2/HASHERS[41].upd_w/tx_w15<27>
uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>_26
SLICE_X78Y124.CIN net (fanout=1) 0.003 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>27
SLICE_X78Y124.CLK Tcinck 0.341 uut2/HASHERS[41].upd_w/tx_w15<31>
uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_xor<0>_30
uut2/HASHERS[41].upd_w/tx_w15_30
------------------------------------------------- ---------------------------
Total 11.716ns (3.484ns logic, 8.232ns route)
(29.7% logic, 70.3% route)
It's being forced to route from a RAMB16BWER to a CLB that's right smack dab in the middle of a group of columns, furthest possible position from possible RAM locations. Here, check this image out, it will make you go insane, so don't stare too long:
http://i.imgur.com/gBv5R.png (RAM is on the left).
No, seriously, don't stare at it. The router will drive insanity into the depths of your soon rotting brain fleshes. After exploding the Universe, of course.Oh, you looked at it anyway and are wondering about that little path heading downward? Yeah, it keeps going ... and going ... (into my damned soul).
And as you can read from the timing report above, routing accounts for *drum roll* 70.3%! Yay! That's 8ns of routing, and only 3.4ns of logic! Imagine if we got rid of all the routing...
I see four solutions at the moment, and will investigate as time allows:
1) Get rid of the RAMB16BWER to some extent.
2) Add an extra register to the output of shifter_32b when inferring RAM logic. Flip-flops should route close to the logic and mask RAM routing delay.
3) Add two, duplicate registers to the output of shifter_32b when inferring RAM logic.
4) Ditch the RAM infer completely and try to coax ISE into using all those flip-flops in the useless slices (which are peppered throughout the routed design at the moment).
I will try 3 first, and hope ISE does the intelligent thing. My hope is that flip-flops in the useless slices will get utilized, since they're mingled in with the useful logic and so should provide somewhat fast local routing.
The interesting this is that we've got lots of RAM to play with. The design is using ~30% of the 16BWERs, and none of the 8BWERs. It seems like a good idea to try to use them and bring slice consumption down if possible, but only if their awkward placement can be solved appropriately.