So it looks to me like you wouldn't expect to sort them in the same order even in japanese localizations. I wonder if there really is some sort of mistake that was done in the sorting of this text file.
I don't think there has been sorting. Or there was initially, and over time it's crept towards unsorted. Because here's the thing: the Japanese list only fails for ~50% of test vectors. So if you run the aforementioned JSON vectors, the first test
passes, whereas the second
does not.
So in the word list っ and つ are sorted as the same letter. This may be standard for japanese localization, but if you don't have this localization installed you get a different order.
Yes, but
the end user never sees the list. Word lists are created and utilised solely by developers, so it shouldn't matter what locale it is, because the code is looking to find a sorted (by codepoint) word list.
Perhaps a BIP39 failsafe should be checking that
sorted(wordlist) == wordlist.txt
Looking into the official bip39 wordlists it seems other languages have the same problem, e.g.,
french:
belette
bélier
belote
bénéfice
berceau
I'm not sure what the right solution is. Sorting the word list with LANG=C would change the indices of the words and break compatibility. Requiring that localization setting must match the language of the word list make automatic tests difficult. Especially if you don't have support for the language installed.
Another issue with French (though the 2 languages for Chinese are worst).... The French wordlist shares 100 words with English. So vector with entropy of 32 nullbytes, for eg, can only be differentiated as French (English =
abandon abandon abandon abandon abandon abandon abandon abandon abandon abandon abandon about) by the checksum, since
abandon is both French and English (different index too IIRC). But this isn't always possible, since the mnemonic can occasionally be all shared words. Not the same issue, but indexes will differ for each language.
Between Electrum doing it's own thing, and each bip39 language being unsorted, there's major issues that need ironing out sooner rather than later..
The weird thing is, only you 2 seem to agree there's an issue in having locale sorted words!