BIP39 foreign language wordlists not sorted

dabura667

sr. member

Activity: 475

Merit: 254

Quote from: johoe on July 05, 2015, 03:44:02 PM

So in the word list っ and つ are sorted as the same letter. This may be standard for japanese localization, but if you don't have this localization installed you get a different order. The bad thing is that the binary-search method in the bip39 mnemonics tool doesn't work if the list is not sorted. Thus, for example, the unit tests of python-mnemonics fail.

This is my fault:
I sorted the list BEFORE NFKD normalizing it.
Sorting again AFTER NFKD normalizing it will produce the results mentioned.

When I first made the pull request, I was under the mistaken impression that the lists were not to be NFKD normalized, and all NFKD normalization would occur in the apps... but someone corrected me, and I fixed the NFKD normalization of the list... however, it seems the order should have been changed too.

Unfortunately, there are people using phrases generated with this word order.

BIP39's weakness: the checksum depends on the order of the wordlist... therefore requiring the wordlist... but the BIP says "not require wordlist" while at the same time "require check the checksum" (which requires knowing the wordlist lol)

I can understand why ThomasV removed BIP39 functionality from Electrum now...

johoe

full member

Activity: 217

Merit: 263

Quote from: HeadsOrTails on July 07, 2015, 01:13:44 AM

Another issue with French (though the 2 languages for Chinese are worst).... The French wordlist shares 100 words with English. So vector with entropy of 32 nullbytes, for eg, can only be differentiated as French (English = abandon abandon abandon abandon abandon abandon abandon abandon abandon abandon abandon about) by the checksum, since abandon is both French and English (different index too IIRC). But this isn't always possible, since the mnemonic can occasionally be all shared words. Not the same issue, but indexes will differ for each language.

For Chinese the words that are in both lists have the same numeric ID. If I understand it correctly they contain exactly the same words, it's just that they use different variants of Chinese symbols (simplified vs. traditional). This may be bad, because the traditional Chinese seed produces different keys than the simplified Chinese seed, even though they both match the checksum.

For French vs. English, the situation is not so bad. You have some seeds where you can't say whether they are French or English, sometimes the checksum is okay for French, sometimes for English and sometimes for both languages. But the keys produced are at least the same (unless the French and English seeds differ by accents). The 128/256 bit seed from which the passphrases were produced is not used for anything, so it is not so bad if they differ.

Changing the word lists opens new problems. We can't know how many French/Spanish/Chinese/Japanese seeds are already in use. I think, the most practical solution is to just drop the requirement that the word lists are sorted and require a linear search. When the language is not unique, check all supported languages and accept the seed if it has the right checksum for at least one language. Auto correction requires that the user chooses the language before entering the seed. In the end the private keys are computed from the seed, ignoring the language for which it was computed. The checksum is only there to prevent people from doing stupid things like inventing their own seed with low entropy.

HeadsOrTails

full member

Activity: 233

Merit: 102

Quote from: tspacepilot on July 06, 2015, 10:16:50 AM

So it looks to me like you wouldn't expect to sort them in the same order even in japanese localizations. I wonder if there really is some sort of mistake that was done in the sorting of this text file.

I don't think there has been sorting. Or there was initially, and over time it's crept towards unsorted. Because here's the thing: the Japanese list only fails for ~50% of test vectors. So if you run the aforementioned JSON vectors, the first test passes, whereas the second does not.

Quote from: johoe on July 05, 2015, 03:44:02 PM

So in the word list っ and つ are sorted as the same letter. This may be standard for japanese localization, but if you don't have this localization installed you get a different order.

Yes, but the end user never sees the list. Cheesy

Word lists are created and utilised solely by developers, so it shouldn't matter what locale it is, because the code is looking to find a sorted (by codepoint) word list.

Perhaps a BIP39 failsafe should be checking that

Code:

sorted(wordlist) == wordlist.txt

Quote from: johoe on July 04, 2015, 01:18:34 PM

Looking into the official bip39 wordlists it seems other languages have the same problem, e.g., french:

Code:

belette
bélier
belote
bénéfice
berceau

I'm not sure what the right solution is. Sorting the word list with LANG=C would change the indices of the words and break compatibility. Requiring that localization setting must match the language of the word list make automatic tests difficult. Especially if you don't have support for the language installed.

Another issue with French (though the 2 languages for Chinese are worst).... The French wordlist shares 100 words with English. So vector with entropy of 32 nullbytes, for eg, can only be differentiated as French (English = abandon abandon abandon abandon abandon abandon abandon abandon abandon abandon abandon about) by the checksum, since abandon is both French and English (different index too IIRC). But this isn't always possible, since the mnemonic can occasionally be all shared words. Not the same issue, but indexes will differ for each language.

Between Electrum doing it's own thing, and each bip39 language being unsorted, there's major issues that need ironing out sooner rather than later..

The weird thing is, only you 2 seem to agree there's an issue in having locale sorted words!

tspacepilot

legendary

Activity: 1456

Merit: 1083

I may write code in exchange for bitcoins.

That's very interesting, thanks for showing me that, johoe. I looked up those characters and the smaller one is used almost as a sort of diacritical mark to double the following consonant:

https://en.wikipedia.org/wiki/Sokuon

So it looks to me like you wouldn't expect to sort them in the same order even in japanese localizations. I wonder if there really is some sort of mistake that was done in the sorting of this text file. (also, I just learned something cool about japanese orthography, so thanks at least for that you guys!)

johoe

full member

Activity: 217

Merit: 263

Code:

>LANG=C sort japanese.txt | diff japanese.txt -
 あたる
+あっしゅく
 あつい
 あつかう
-あっしゅく
 あつまり
 あつめる
 あてな
 あてはまる
 あひる
+あふれる
 あぶら
 あぶる
-あふれる
 あまい
...

So in the word list っ and つ are sorted as the same letter. This may be standard for japanese localization, but if you don't have this localization installed you get a different order. The bad thing is that the binary-search method in the bip39 mnemonics tool doesn't work if the list is not sorted. Thus, for example, the unit tests of python-mnemonics fail.

tspacepilot

legendary

Activity: 1456

Merit: 1083

I may write code in exchange for bitcoins.

Looking at that japanese.txt wordlist file, they seem sorted to me. What order did you want them to be sorted in?

johoe

full member

Activity: 217

Merit: 263

Could it be a problem of localization? Every language has its own sorting rules and often this is implemented in the library. Maybe if you have japanese localization the list is sorted correctly.

Looking into the official bip39 wordlists it seems other languages have the same problem, e.g., french:

Code:

belette
bélier
belote
bénéfice
berceau

I'm not sure what the right solution is. Sorting the word list with LANG=C would change the indices of the words and break compatibility. Requiring that localization setting must match the language of the word list make automatic tests difficult. Especially if you don't have support for the language installed.

HeadsOrTails

full member

Activity: 233

Merit: 102

http://stackoverflow.com/a/31156743/3936486

Note the Japanese word list: https://github.com/trezor/python-mnemonic/raw/master/mnemonic/wordlist/japanese.txt

It's been raised here that a binary search (instead of using a built in list index in Python) relies on a sorted list of words. That's not an issue for English, but Japanese would fail for the test vectors.

BIP39 specifies wordlists should be sorted.

Any insights?

Topic: BIP39 foreign language wordlists not sorted (Read 1514 times)