Don't forget that running one card on one slot (100% dedication to bus-speed/calls/power), is not the same as running two cards on two slots. It will NEVER scale-up to 2x, what any single card will get, at any time.
The "controller" for the bus uses an internal "thread", to distribute calls that the CPU is sending all at once. There is a slight buffer in there, when more than one card is installed. (It has to send data through x-dedicated-lines, and y-shared lines, to "simulate" full 16x support.)
Also note... Though you may have 3(16x) slots... rarely will you ever have 16x/16x/16x dedicated speeds. At the most, with one cheap controller, the slots reconfigure for...
(16x/8x/4x) or (16x/16x/4x) or (16x/8x/8x) or (8x/8x/8x) or (8x/8x/4x) etc... after adding a card.
Prior to adding three cards, slots may be like this...
(16x/xx/xx) -> (16x/16x/xx) -> (16x/8x/8x)
or
(16x/xx/xx) -> (16x/8x/xx) -> (16x/4x/4x)
or if only one card, in any slot...
(16x/xx/xx) or (xx/16x/xx) or (xx/xx/16x) They do this so you are not LIMITED to only stuffing the card into the first slot, like AGP. Not so that all slots will run at 16x speeds, all at once.
If all slots are 16x, they can still be shared-16x... (Effectively being only 16x for the most demanding card, on all slots, but the total bandwidth will not be 16x*3, ever.)
Thus, one card has that full capacity, one runs slightly less, one runs even less. But in a auto-balanced-load, all cards run the same speed, >4x or >8x and <16x.
To get more than 4(16x) true slots, you would need two controllers, and two CPU's. Controllers are physically limited to only 7-slots max, for PCIe. Even with load-balance and buffering, and sharing.
NOTE: Using all 1x connections actually yields higher numbers. There is NO buffering, and no load-balanced-shared-lines. Each has dedicated busses directly to the controller. (Thus, no "stalling" from that issue. Also no "cross-talk" noise on the datalines.)
ALSO NOTE: SATA and USB and any onboard-sound or WiFi devices may ALSO be using that same controller as a directly wired PCIe connection. (Some use PCI, but PCI is all on that 8th slot that is unseen.) That is more "call events" and buffering.
Call-event -> send-data -> process data -> return-data. (If you have more time processing harder data, you have less calls. Thus, you see less "stalling". That is the draw-back of having "small hash-difficulty"... You are processing faster than the call-times. Longer processing, and the call-times are like 1%, not 50% of the time. Using arrays and pre-sending while processing, into the buffer, reduces this variance. Sending each call one at a time, is taxing. But that is the programmers issue, and the pool-operators issue, using low difficulties for shares. Solo-mine, and you never see this issue as bad.)
In short... the cards are just too damn fast for the program.