It would be far easier and more effective to remove most of the transaction-handling code from p2pool, and make share objects contain only the merkle root plus the coinbase transaction (and the merkle path thereto). However, I have not had the time to do that, and nobody else seems to be interested in doing it. Even though it would be much easier, it's not super easy to do.
Note: Bitcoin Core is *also* mostly single-threaded, although not quite to the same extent as p2pool.
Wouldn't it be possible to rewrite the code and compile it with cython? This avoids the GIL. It also easily allows to mix python with C/C++ code, which might be useful for some parts of the code, and it would avoid rewriting everything in another language. However, I don't think twisted can be used by cython (I might be wrong on this), so this would need to be replaced with something different to handle the asynchronous networking (which might be quite a lot of work).
I've been running the 1mb_segwit code for quite some time now. It seems that the optimalisations done by jtoomim and the use of pypy has reduced the single core load quite considerably compared to the main code. So, although some people still have problems, it alleviated in my opinion the single thread bottleneck considerably.
@jtoomim: what parts of the transaction handling can be removed in your opinion?