Hello,
I would like to thanks arulbero who gave me by MP a great tip to improve speed by MP using some symmetries
![Wink](https://bitcointalk.org/Smileys/default/wink.gif)
I missed this, shame on me.
It will save few modular mult. But however, ~40% of cpu is used for modular mult, other 60% mainly go to SHA,RIPE,Base58,ModInv and byteswapping, so I don't know if I can reach the 2.0MKey/s (x 1.66)
For linux (cpu side), I have to work on code generation optimization but assembly using AT&T syntax makes me crazy.
Anyway, I managed to set-up CUDA sdk 8.0 on the old Ubuntu PC. I had to patch the nvidia driver, a nightmare.
But now CUDA works, I managed to compile sample code and make it work, so i will be able to develop the multi GPU release of vanitysearch.