The only cgminer parameters I'm using are:
-I 9 --auto-fan --auto-gpu --gpu-engine 750-950 --gpu-memclock 300
so I'm not specifying any work sizes or vectors (unless I 9 is work size). I'm mining to P2Pool, and my local stale rate is ~8.6%
The only way I know of making it hash faster is by increasing the Mhz to above 950, but at that point my system gets really unstable, and sometimes either locks up or bluescreens :/
You might want to see what --worksize 64 or --worksize 256 does for your hash rate.
The stale rate of p2pool isn't comparable to other pools as p2pool uses a much higher difficulty for its shares...
Due to high value or each high-difficulty share, I recommend you use --submit-stale when mining on p2pool.
This will make cgminer submit every share it finds, even if it thinks it is already stale. This won't make a huge difference but might net you an additional share every now and then.
Intensity is a fine tuning parameter and its optimal value is dependent on the card's hash rate.
Intensity influences the size of each uninterruptible batch of calculations the GPU performs - high intensity batches of work are larger, thus taking longer to finish.
This results in occasional situations where the GPU takes a couple of seconds too long and produces a valid but already stale result wasting the computation time.
Try lowering your intensity to 8 - if the hash rate doesn't budge you made the right call and your card didn't benefit from the higher intensity.
My general observations have been that best results are achieved (using the default two threads per gpu):
+ with <200MHash/s cards: intensity 7
+ [200..400] MHash/s : intensity 8
+ >400 MHash.s : intensity 9
Intensities beyond 9 are meant only for the new 7xxx family of cards and won't do any good when used with the old, slower cards.