Using Arithmetic coding for example.
You can even perform it on words level, not the letters level.
For example, the first two bits of the "lexical token" will mean either:
a) the following 6 bits encode one of 64 most popular words;
b) the following 14 bits encode one of 16000+ of less popular ones;
c) the following is UTF-8 encoded chunk of length (in symbols, not bytes) 1..32 if the following bit is 0, or VARINT shifted by three bits encoded length if it's 1;
d) same for UTF-16LE, where length is the number of Unicode symbols.
The problem with this is that I want to provide it for at least half a dozen languages, and the project is likely to outgrow it's hobbyist nature :=)
Another argument in favour of 5-bit scheme is that it's very simple, can be easily translated into many programming languages if becomes popular.
For the "Morse telegraph" option I would need at least to create the frequency tables of letters in each language (or find them on the web).
P.S. In the original solution, codepage number can occupy more than 3 bits if it's in the end: the most popular ones will be numbered 0..7, in Little Endian, and any zeroes at the end can be dropped if do not fit into the "last byte".