A minimalist text encoding for OP_RETURN outputs

byteball

member

Activity: 266

Merit: 42

The rising tide lifts all boats

Quote from: QRC on May 25, 2018, 10:48:19 AM

Even using 5 bit alphabet you waste a lot of space, you need to compress message first,
Using Arithmetic coding for example.
You can even perform it on words level, not the letters level.

Right, I have thought about both options, first to assign different number of bits to symbols depending on their frequency (from 4 to 6 or 7 bits, but I would need to make some measurements of efficiency on English texts first) and then to use dictionaries for each language. Like https://github.com/bitcoin/bips/blob/master/bip-0039/english.txt

For example, the first two bits of the "lexical token" will mean either:
a) the following 6 bits encode one of 64 most popular words;
b) the following 14 bits encode one of 16000+ of less popular ones;
c) the following is UTF-8 encoded chunk of length (in symbols, not bytes) 1..32 if the following bit is 0, or VARINT shifted by three bits encoded length if it's 1;
d) same for UTF-16LE, where length is the number of Unicode symbols.

The problem with this is that I want to provide it for at least half a dozen languages, and the project is likely to outgrow it's hobbyist nature :=)
Another argument in favour of 5-bit scheme is that it's very simple, can be easily translated into many programming languages if becomes popular.
For the "Morse telegraph" option I would need at least to create the frequency tables of letters in each language (or find them on the web).

P.S. In the original solution, codepage number can occupy more than 3 bits if it's in the end: the most popular ones will be numbered 0..7, in Little Endian, and any zeroes at the end can be dropped if do not fit into the "last byte".

QRC

newbie

Activity: 10

Merit: 2

Even using 5 bit alphabet you waste a lot of space, you need to compress message first,
Using Arithmetic coding for example.
You can even perform it on words level, not the letters level.

byteball

member

Activity: 266

Merit: 42

The rising tide lifts all boats

I know the topic of using bitcoin-codebase-based blockchains for storing arbitrary data (e.g. IM chat) is controvercial, unless
developers or community of given blockchain invite and embrace such use, but here's an idea:

a lot of alphabets fit into 32 symbols, whether with punctuation or without. That is, without capital letters (or ALL CAPS).
We can reserve certain codes at the end of 0..31 sequence where the last will be the most common (space, comma, period).
For Latin without diacritics, that would allow 32-26=6 punctuation symbols.
For Hebrew without niqqudoth, that would allow even more, 32-22=10 symbols. This is assuming that we don't waste alephbeith positions on sophit nun, qaf etc. - the usual letter is decoded as sophit if used before space or as last one in chunk. Although sometimes we would want non-sophit at the end of the word, which will be marked by special "alternative space" punctuation, which can be represented with usual space+modifier. 9 punctuation marks apart from space seem plenty for Hebrew, but we need at least geresh, dagesh/mapiq/vav shuruq/sindot modifier of the previous letter as 1-fit-all mark in order to disambiguate between similarly written words, maybe shva for the same purpose, period, comma, hyphen, colon/semicolon, em(n)dash (could be double and triple hyphen) and so on.
For Cyrillics, we would need to drop some letters even to have a space in the alphabet. Alternatively, we could take 6 bits and represent almost any cyrillics variety whether from ex-USSR or outside. This extended CYR can contain all 10 Arabic/Indian digits in it, whereas for 5-bit Latin such digits would have to be chunked in a separate codepage. Absence of digits is not so much a problem for Hebrew where we can denote letter-coded numbers with a punctuation mark.
Greek would still have some punctuation available. Etc.

3 bits seem enough to represent the code page, but see my later post, in P.S.
So the grammar would be:
VARINT length of the message (1 or more full bytes)
5 bits first symbol
...
5 bits last symbol
3 or more bits code page
pad with zero bits until the last full byte, and/or append any arbitrary data like Int, Long, BigDecimal that would make the message complete.

If you have some latin letters in the middle of another alphabet, you can repeat the said grammar 3 times: LEN1-MSG1-CP1+LEN2-MSG2-CP2+LEN3-MSG3-CP3

We can save 3 bits for Latin as it will be the default.

Obviously, if the message doesn't fit into one output, another transaction can spend the change from the previous one and provide the next chunk.
Not very efficient, but would be tolerated by almost any bitcoin-derived blockchain in existence.

Thoughts?

Topic: A minimalist text encoding for OP_RETURN outputs (Read 159 times)