I can't help you choose parameters because I don't have any of those, still trying to find a way to lease a GPU for bitcoin (turns out it's very hard to find someone who's not trying to sell you a miner ). But the reason why VanitySearch is much faster than this is not related to the number of characters they are comparing, it's due to the fact that Jean_Luc's code generates a "jump table" of secp256k1 points (see e.g. GPU/Group.h) so he avoids actually doing the elliptic curve math at runtime, unlike brichard's code. This is despite both their engines being written in CUDA.
In VanitySearch he also writes all his math in assembly code while Bitcrack uses C for its math functions.
Several pages back somebody posted here that they used -b 112 -t 512 -p 512 for their 1080 and got similar speeds as you.
Are your python files compiled? Try doing that so that your code doesn't have to be interpreted at runtime, using python -m compileall YOUR_PY_FILES.
What do you mean? Bitcrack never printed how much CPU or memory it uses in the console. Do you mean in Task Manager? If so then you'll usually find it as a child process of Windows Command Processor (and Task Manager is going to show that using a lot of resources but it's actually the child process running inside it using all that).