[MERGED] BIP-39 List of words in Portuguese accepted!! - page 2.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: bitmover on September 14, 2020, 10:30:07 PM

Quote from: NotATether on September 14, 2020, 07:15:50 PM

What would you guys think of me adding a method to fetch the wordlist from the internet, by supplying the URL? It would allow you to instantly check the wordlist if it's on places like Github without having to download it first.

e.g. In your web browser you go to the wordlist file on Github, and then click on View Raw to get this kind of link https://raw.githubusercontent.com/bitcoin/bips/03ff98d00717804723a2c4db8188c0b5cf0cbfbf/bip-0039/romanian.txt, which is a plain text file which can be downloaded and processed easily.

You can add a field to upload the file directly or to add the URL. This makes little difference, as the person who is working with the wordlist certainly have the txt file in his computer or he can access github directly.

It is more important to be able to upload the txt file imo

Where are likely sites where people would want to upload the wordlist though? At least for Github and other source control sites, it can already be tracked and committed with git, so I don't really see a point in adding such a feature.

bitmover

legendary

Activity: 2352

Merit: 6089

bitcoindata.science

Quote from: NotATether on September 14, 2020, 07:15:50 PM

What would you guys think of me adding a method to fetch the wordlist from the internet, by supplying the URL? It would allow you to instantly check the wordlist if it's on places like Github without having to download it first.

e.g. In your web browser you go to the wordlist file on Github, and then click on View Raw to get this kind of link https://raw.githubusercontent.com/bitcoin/bips/03ff98d00717804723a2c4db8188c0b5cf0cbfbf/bip-0039/romanian.txt, which is a plain text file which can be downloaded and processed easily.

You can add a field to upload the file directly or to add the URL. This makes little difference, as the person who is working with the wordlist certainly have the txt file in his computer or he can access github directly.

It is more important to be able to upload the txt file imo

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Someone just made a pull request for a Romanian wordlist: https://github.com/bitcoin/bips/pull/993

That should be a wake up call for me to speed up development. I almost finished the programmatic API. Docs are easy to write, and I think I'll use Nose as my testing suite.

EDIT: Levenshtein distance section of the API completed, initial unique characters and maximum length are almost finished

What would you guys think of me adding a method to fetch the wordlist from the internet, by supplying the URL? It would allow you to instantly check the wordlist if it's on places like Github without having to download it first.

e.g. In your web browser you go to the wordlist file on Github, and then click on View Raw to get this kind of link https://raw.githubusercontent.com/bitcoin/bips/03ff98d00717804723a2c4db8188c0b5cf0cbfbf/bip-0039/romanian.txt, which is a plain text file which can be downloaded and processed easily.

bitmover

legendary

Activity: 2352

Merit: 6089

bitcoindata.science

Quote from: NotATether on September 12, 2020, 05:45:15 PM

Great job, your dictionary is a good start. To utilize it I could write something like this:

Code:

for i in range(0,len(word)):
  char = word[i]
  try:
    word[i] = bitmover_table[char]
  except KeyError as e:
    pass

*So the letters in your dictionary, are they only from the spanish alphabet or did you include letters used in other European languages?

You can add more letters if you think it is necessary, I am not sure this dictionary will cover all possibilities.
About your code, I don't like to use Loops unless it is extremely necessary. Loops are computational costly and makes your code slow.

I did this in my code:

Code:

import pandas as pd
accent_dict = {...}
spanish = pd.read_csv('spanish.txt', header = None)
spanish=spanish.replace(accent_dict , regex=True)

Code will be cleaner.
1 line and faster processing instead of a loop

Quote

In the long term, I want to avoid dealing with bugs where a new wordlist is made that has characters that aren't on the list and those pass through without getting de-accented. So I decided to use the code in https://stackoverflow.com/a/15547803/12452330 instead, it checks if the character's Unicode name has "WITH" (WITH ACCENT, WITH CIRCUMFLEX, etc.), remove that part from the name, and then reverse look up the (normal) character from the name.

As a side effect, this actually not only remove diacritics from Latin characters, it also returns diacritics from non-Latin characters like greek, arabic, as long as its unicode name has the word WITH in it, and if it doesn't, it just leaves it alone. e.g:

"ά": GREEK SMALL LETTER ALPHA WITH TONOS -> "α": GREEK SMALL LETTER ALPHA

It even strips them from symbols and emojis, which is probably not desirable but I don't need such support for this project. This stack overflow code is very handy in case I ever work on another project that needs this capability.

Bear in mind that the rest of the program fundamentally isn't designed to process non-Latin languages, since the validation rules could be completely different for them, or have more complex combining rules than 1 character/letter. This is just some food for thought for me.

A dictionary is a better approach than using coding in my opinion

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: Coding Enthusiast on September 12, 2020, 10:09:01 PM

Sometimes the implementations want the option to perform a binary search on their word list which is only possible if the array they are working with (the string[2048] here) is sorted.
This is also suggested in BIP-39 `Wordlist > c) sorted wordlists`

Good catch, reminding me of the exact number of words that should be in a wordlist. I will also implement a check which validates that there are exactly 2048 words in the list.

Coding Enthusiast

legendary

Activity: 1042

Merit: 2805

Bitcoin and C♯ Enthusiast

Quote from: bitmover on September 12, 2020, 03:18:07 PM

Sorting is not required

Sometimes the implementations want the option to perform a binary search on their word list which is only possible if the array they are working with (the string[2048] here) is sorted.
This is also suggested in BIP-39 `Wordlist > c) sorted wordlists`

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: bitmover on September 12, 2020, 03:18:07 PM

Sorting is not required, but as it is extremely easy to do in every software of language (even in excel), I think it is very basic and elegant to submit your list sorted out. In no way I would submit my list in a random order, unless there would be a reason to do so.

I think you could implement a question like "your word list is not sorted. Would you like to sort it now?"

I designed my program to have as little user-interaction as possible, because it's easier to show some sort of report card that shows you the status of each test, and where specifically in each test is wrong so you can immediately go to that part of the file and fix it. In simple terms, my program lets you can control which tests to enable from the command line, it prints progresses and status messages as it performs each tests, and it tells you which tests passed and failed.

I am not comfortable with modifying the wordlist file in-place because it could have bugs that mistakenly mess up the wordlist. I've done that error too many times in other projects and I don't want to take any chances here. So I think I will just print a warning if it detects the list isn't sorted.

Quote from: bitmover on September 12, 2020, 03:18:07 PM

I made a dictionary with all special characters and replaced Spanish special characters using that dictionary. Then I checked for common words in my list and spanish. I can share the code if you wish.

My dictionary is here
https://bitcointalksearch.org/topic/m.55131643

Great job, your dictionary is a good start. To utilize it I could write something like this:

Code:

for i in range(0,len(word)):
  char = word[i]
  try:
    word[i] = bitmover_table[char]
  except KeyError as e:
    pass

So the letters in your dictionary, are they only from the spanish alphabet or did you include letters used in other European languages? In the long term, I want to avoid dealing with bugs where a new wordlist is made that has characters that aren't on the list and those pass through without getting de-accented. So I decided to use the code in https://stackoverflow.com/a/15547803/12452330 instead, it checks if the character's Unicode name has "WITH" (WITH ACCENT, WITH CIRCUMFLEX, etc.), remove that part from the name, and then reverse look up the (normal) character from the name.

As a side effect, this actually not only remove diacritics from Latin characters, it also returns diacritics from non-Latin characters like greek, arabic, as long as its unicode name has the word WITH in it, and if it doesn't, it just leaves it alone. e.g:

"ά": GREEK SMALL LETTER ALPHA WITH TONOS -> "α": GREEK SMALL LETTER ALPHA

It even strips them from symbols and emojis, which is probably not desirable but I don't need such support for this project. This stack overflow code is very handy in case I ever work on another project that needs this capability.

Bear in mind that the rest of the program fundamentally isn't designed to process non-Latin languages, since the validation rules could be completely different for them, or have more complex combining rules than 1 character/letter. This is just some food for thought for me.

bitmover

legendary

Activity: 2352

Merit: 6089

bitcoindata.science

Quote from: NotATether on September 12, 2020, 12:21:12 PM

There are two oddities that have me stumped while writing the logic to process the word list file. All of the words in every wordlist I checked are sorted. Is it a hard requirement for submitted wordlists to be sorted? My code assumes valid wordlists might be in a random order. If sorting is absolutely required, I could introduce another check that tests whether the words are in order.

Sorting is not required, but as it is extremely easy to do in every software of language (even in excel), I think it is very basic and elegant to submit your list sorted out. In no way I would submit my list in a random order, unless there would be a reason to do so.

I think you could implement a question like "your word list is not sorted. Would you like to sort it now?"

Quote

Second, the Spanish wordlist contains words with accents in them. https://github.com/sabotag3x/bips/blob/master/bip-0039/spanish.txt but that wordlist's rules say:

Quote from: https://github.com/sabotag3x/bips/blob/master/bip-0039/bip-0039-wordlists.md#spanish

Special Spanish characters like 'ñ', 'ü', 'á', etc... are considered equal to 'n', 'u', 'a', etc... in terms of identifying a word. Therefore, there is no need to use a Spanish keyboard to introduce the passphrase, an application with the Spanish wordlist will be able to identify the words after the first 4 chars have been typed even if the chars with accents have been replaced with the equivalent without accents.

personally, I think that accepting words with accent a big mistake. I would reject it straight away. And I wouldn't use that word list

Because " àbaco" and "abaco" are different words and it could lead to some problem in some software.

Portuguese list won't have words with special characters.

Quote

Despite the list having accented characters in it, should applications accept the words typed only in the form without accents, so that Spanish wordlist processing is consistent with submitted wordlists in other Latin languages? Personally I'm not in favor of implementing a special case during validation for handling words in Spanish, because I want my implementation to be reusable.

The accented characters also make it slightly harder to check the validity of a word because I now have to convert a Latin character to its non-accented form (my current code tests if the character is between "a" and "z"). Does anyone know a Python function or module that can do this?

I made a dictionary with all special characters and replaced Spanish special characters using that dictionary. Then I checked for common words in my list and spanish. I can share the code if you wish.

My dictionary is here
https://bitcointalksearch.org/topic/m.55131643

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

There are two oddities that have me stumped while writing the logic to process the word list file. All of the words in every wordlist I checked are sorted. Is it a hard requirement for submitted wordlists to be sorted? My code assumes valid wordlists might be in a random order. If sorting is absolutely required, I could introduce another check that tests whether the words are in order.

Second, the Spanish wordlist contains words with accents in them. https://github.com/sabotag3x/bips/blob/master/bip-0039/spanish.txt but that wordlist's rules say:

Quote from: https://github.com/sabotag3x/bips/blob/master/bip-0039/bip-0039-wordlists.md#spanish

Special Spanish characters like 'ñ', 'ü', 'á', etc... are considered equal to 'n', 'u', 'a', etc... in terms of identifying a word. Therefore, there is no need to use a Spanish keyboard to introduce the passphrase, an application with the Spanish wordlist will be able to identify the words after the first 4 chars have been typed even if the chars with accents have been replaced with the equivalent without accents.

Despite the list having accented characters in it, should applications accept the words typed only in the form without accents, so that Spanish wordlist processing is consistent with submitted wordlists in other Latin languages? Personally I'm not in favor of implementing a special case during validation for handling words in Spanish, because I want my implementation to be reusable.

The accented characters also make it slightly harder to check the validity of a word because I now have to convert a Latin character to its non-accented form (my current code tests if the character is between "a" and "z"). Does anyone know a Python function or module that can do this?

bitmover

legendary

Activity: 2352

Merit: 6089

bitcoindata.science

Quote from: Coding Enthusiast on September 08, 2020, 06:40:33 AM

Quote from: bitmover on September 08, 2020, 06:13:45 AM

What do you think Coding Enthusiast, should we try to keep Levenshtein distance > 1? That is not an easy tasks but certainly doable.

I think you should try to avoid it if possible. It's definitely beneficial to keep all the words as distinct as possible. For example in the case above simply a bad handwriting could cause issues between letter 'c' and 't' in "cable" and "table" and having multiple one of these mistakes in a mnemonic could potentially make recovery impossible.

Thanks for your suggestion. We decided to keep with that restriction (removing all Levenshtein distance =1). Our wordlist is going to be the one with the most restricted rules.
French wordlist followed Levenshtein distance =1 rule, however they didn't worry about repeting words from others lists like we did.

https://github.com/bitcoin/bips/pull/152#issuecomment-412618598

I hope our list will be quickly accepted. We did a nice work.

Quote from: NotATether on September 08, 2020, 05:24:29 PM

P.S. as for the progress of my wordlist validator program, I have implemented the command line arguments and logging functionality. I am currently implementing a progress bar so people can know how far in validation it's in. When that's finished the program will be complete, I'll just have to write a stable API, documentation and unit testing for the whole thing and then it should be ready for publishing.

Share with us this code when you are done. There are still other languages to make a wordlist. and your program may also be used in other projects that we don't know of yet.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: Coding Enthusiast on September 08, 2020, 06:12:59 AM

Quote from: ?? on ??

Interesting find. What do you think about about Jaro Similarity? No idea how it works, but the output is fixed from 0.0 to 1.0 and i think it'd be easier to determine the threshold.

The Levenshtein algorithm considers 3 factors: deletion, insertion and substitution
Damerau–Levenshtein adds transposition to the above 3.
Meanwhile the Jaro algorithm only considers similarity (characters in common but only in a short distance)
Jaro-Winkler is the same as Jaro but gives a more favorable result when the characters in common are from the beginning of the string (ie. AAAXYZ and AAAUVW are more similar than XZYAAA and UVWAAA).

I believe the results could pretty much give the same conclusion about the word lists BIP-39 is dealing with. But I think Levenshtein may be better here. For example Jaro-Winkler gives 0.866 for "cable" and "table" (closer to 1 means less similar) while Levenshtein returns 1 which is a much better indication of it being bad.

I am against using Jaro-Winkler similarity for measuring distances because it is tainted by its weighing earlier characters more. I feel like it's trying to take on the task of both measuring distance and counting initial unique characters, but it is not effective for measuring either of them because it just adds the distance metric and a very scaled down initial character uniqueness measurement together. IMHO adding two metrics together just ruins the measurement.

Jaro similarity is a little better, it just measures distance but I notice that character swaps have less weighting on the metric than the presence of unique characters in either two words, which makes sense if you are feeding a program with input that has several similar words in it, but interchanges between adjacent characters and deletions are the most common mistakes people make when writing from a wordlist. Plus the percentage metric doesn't lend itself well to quantifying the number of character replacements you need to get from one distance to another, at least by the human brain. I can't just say, "for a Jaro distance of 0.7 I need to change characters to make it 0.6".

For typing there are also typos made not by swapping but by typing the adjacent character on the Qwerty keyboard. A typo is just a substitution, which can be modeled by deletion and insert pair, additional typos can replace one of the deletes and one of the inserts with a swap*. There may already be production algorithms that take proximity of the neighboring Qwerty characters into account when measuring similarity, measuring insert/delete pairs of nearby keyboard characters more harshly than distant characters, and they do exist since search engines can detect typos. And I think that should be the goal when making a wordlist, to filter out as many opportunities to make typos as possible. The guidelines for similarity checking were only created because a small group can't be expected to make such a sophisticated checker

*That's why I think counting swaps with one point instead of two points of insert/delete is bad for distance measurements, as we are falsely making the word pair look more unique. Hence my argument against using Damerau-Levenshtein.

So for simpletons like us, none of the alternative algorithms are good for our needs, and Levenshtein is out best measuring ruler that is not complex to implement.

Quote from: bitmover on September 08, 2020, 06:13:45 AM

What do you think Coding Enthusiast, should we try to keep Levenshtein distance > 1? That is not an easy task but certainly doable.

Levenshtein distance of one means only one substitution needs to be made, like "fish" --> "fist", which has a risk of being spelt incorrectly like Coding Enthusiast mentioned. Insertions and deletions are harder to get wrong though, because a user has to subconsciously type an extra character or omit one. so if you absolutely must use distance 1 word pairs then use ones with an extra or missing letter. But it is unlikely that users will miswrite 2 substituted characters wrong.

P.S. as for the progress of my wordlist validator program, I have implemented the command line arguments and logging functionality. I am currently implementing a progress bar so people can know how far in validation it's in. When that's finished the program will be complete, I'll just have to write a stable API, documentation and unit testing for the whole thing and then it should be ready for publishing.

Coding Enthusiast

legendary

Activity: 1042

Merit: 2805

Bitcoin and C♯ Enthusiast

Quote from: bitmover on September 08, 2020, 06:13:45 AM

What do you think Coding Enthusiast, should we try to keep Levenshtein distance > 1? That is not an easy tasks but certainly doable.

I think you should try to avoid it if possible. It's definitely beneficial to keep all the words as distinct as possible. For example in the case above simply a bad handwriting could cause issues between letter 'c' and 't' in "cable" and "table" and having multiple one of these mistakes in a mnemonic could potentially make recovery impossible.

bitmover

legendary

Activity: 2352

Merit: 6089

bitcoindata.science

Quote from: Coding Enthusiast on September 08, 2020, 01:47:42 AM

I just added the code to compute Levenshtein distance for all existing BIP-39 lists to Bitcoin.Net and the first thing I noticed is that the English list contains words such as "able", "cable", "table", "unable", "viable" with very short distances (1 for the first three and 2 for the other two).

This is a great find.
distance 2 certainly is ok because it would be too restrictive, but I didn't know if a distance of 1 would be acceptable.

Looking carefully at the https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md I found that only French is worried about Levenshtein distance

Quote

French
10. No very similar words with 1 letter of difference.

https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md#french

This is same as Levenshtein distance > 1.

What do you think Coding Enthusiast, should we try to keep Levenshtein distance > 1? That is not an easy task but certainly doable.

Coding Enthusiast

legendary

Activity: 1042

Merit: 2805

Bitcoin and C♯ Enthusiast

Quote from: ?? on ??

Interesting find. What do you think about about Jaro Similarity? No idea how it works, but the output is fixed from 0.0 to 1.0 and i think it'd be easier to determine the threshold.

The Levenshtein algorithm considers 3 factors: deletion, insertion and substitution
Damerau–Levenshtein adds transposition to the above 3.
Meanwhile the Jaro algorithm only considers similarity (characters in common but only in a short distance)
Jaro-Winkler is the same as Jaro but gives a more favorable result when the characters in common are from the beginning of the string (ie. AAAXYZ and AAAUVW are more similar than XZYAAA and UVWAAA).

I believe the results could pretty much give the same conclusion about the word lists BIP-39 is dealing with. But I think Levenshtein may be better here. For example Jaro-Winkler gives 0.866 for "cable" and "table" (closer to 1 means less similar) while Levenshtein returns 1 which is a much better indication of it being bad.

PS. You can run Jaro-Winkler algorithm here on sharplab just change the s1 and s2 in Main() method.

Coding Enthusiast

legendary

Activity: 1042

Merit: 2805

Bitcoin and C♯ Enthusiast

I just added the code to compute Levenshtein distance for all existing BIP-39 lists to Bitcoin.Net and the first thing I noticed is that the English list contains words such as "able", "cable", "table", "unable", "viable" with very short distances (1 for the first three and 2 for the other two).
Other languages don't seem to be any better. Here are only some of the example words not all:
Italian first word in the list is "abaco" which has a similar one "baco" or "sino", "asino".
French seems better but it has words with distance=2 like "apaiser", "abaisser"
Spanish has "bono", "abono" and "abrazo", "brazo"
Czech has words with distance=2 like "abeceda", "beseda" and "adresa", "agrese"
Japanese has "あいさつ", "かいさつ" and "あきる", "あける"
Korean has "가격", "간격"
Chinese results don't make much sense but since the last 3 are complicated languages I'm not sure if the Levenshtein distance is even valid for them.

bitmover

legendary

Activity: 2352

Merit: 6089

bitcoindata.science

@ETFBitcoin and @NotATether

I wanna share this with you both. I did it thanks to your suggestions.

Following ETFBitcoin library suggestion (which is pretty fast btw) I made a code that generated this matrix (we had to delete some words and now we only have 2005).

as you can see, this matrix shows levenshtein distance of all 2005 words with each other in a loop. As 0 compared to 0 is the same word, it is distance is 0. As 1 compared to 1 is also 0, you can see a diagonal line comparing the same words as zero until the end of the last row.

I was able to identify all the values where distance was 1 and I generated a dictionary with those coordinates in the matrix:

Code:

{1: 164, 4: 182, 16: 1521, 23: 516, 31: 567, 32: 33, 33: 32, 35: 67, 39: 677, 51: 1305, 57: 126, 60: 51, 67: 35, 75: 78, 76: 1261, 78: 75, 83: 1655, 103: 107, 104: 140, 105: 106, 106: 105, 107: 103, 117: 1376, 126: 57, 128: 690, 140: 1928, 148: 176, 158: 178, 161: 1767, 164: 1, 166: 1910, 169: 1914, 175: 181, 176: 148, 178: 158, 181: 175, 182: 4, 183: 1019, 187: 221, 188: 205, 190: 697, 194: 234, 195: 1681, 200: 708, 205: 188, 220: 730, 221: 187, 228: 247, 231: 236, 233: 1610, 234: 194, 235: 236, 236: 235, 238: 228, 244: 750, 245: 1617, 247: 228, 252: 255, 254: 1869, 255: 252, 266: 471, 270: 292, 272: 1237, 274: 672, 280: 1102, 281: 1642, 283: 678, 284: 286, 286: 1528, 287: 280, 292: 270, 313: 1135, 314: 318, 315: 1538, 317: 1373, 318: 347, 321: 1014, 329: 1384, 338: 1928, 340: 341, 341: 340, 343: 703, 345: 348, 346: 621, 347: 318, 348: 345, 371: 886, 382: 1598, 395: 408, 399: 976, 402: 737, 403: 405, 405: 403, 408: 1202, 426: 446, 427: 1833, 432: 438, 433: 1214, 438: 432, 439: 433, 443: 1844, 446: 426, 461: 338, 466: 1497, 471: 266, 474: 795, 493: 1573, 516: 23, 517: 1590, 523: 1429, 538: 1976, 539: 550, 540: 544, 542: 538, 544: 1076, 549: 1082, 550: 539, 554: 1279, 557: 857, 567: 31, 568: 726, 613: 659, 619: 1585, 621: 346, 627: 635, 635: 627, 659: 1947, 672: 274, 677: 39, 678: 283, 687: 705, 690: 128, 691: 725, 694: 1027, 696: 1677, 697: 703, 699: 1394, 702: 710, 703: 705, 705: 703, 708: 760, 710: 1811, 715: 1957, 724: 715, 725: 691, 726: 568, 730: 220, 733: 1095, 735: 1073, 737: 402, 740: 1451, 742: 1207, 746: 1214, 747: 1838, 748: 742, 750: 244, 758: 761, 759: 1573, 760: 708, 761: 1862, 764: 759, 765: 1615, 768: 769, 769: 768, 770: 773, 771: 991, 773: 770, 782: 51, 785: 1002, 790: 792, 792: 790, 794: 1923, 795: 785, 798: 802, 801: 1681, 802: 798, 803: 1799, 825: 1196, 827: 1450, 834: 1614, 839: 1851, 842: 1468, 846: 1321, 853: 1878, 857: 557, 859: 1090, 882: 1458, 886: 371, 896: 1092, 912: 915, 915: 912, 919: 945, 945: 919, 955: 1717, 958: 971, 961: 1361, 969: 1381, 971: 958, 973: 1393, 976: 399, 978: 1073, 979: 978, 987: 989, 989: 987, 991: 771, 993: 1334, 997: 1102, 1002: 785, 1010: 997, 1014: 321, 1015: 1025, 1016: 1143, 1018: 1379, 1019: 183, 1025: 1015, 1027: 694, 1028: 1053, 1031: 1155, 1036: 1801, 1038: 1573, 1042: 1414, 1044: 1038, 1046: 1042, 1048: 1065, 1051: 1055, 1053: 1028, 1054: 1070, 1055: 1051, 1057: 1307, 1058: 1962, 1063: 1069, 1065: 1066, 1066: 1065, 1069: 1600, 1070: 1054, 1073: 978, 1074: 1716, 1075: 1829, 1076: 1074, 1077: 1205, 1082: 549, 1087: 1738, 1090: 1094, 1092: 896, 1094: 1090, 1095: 733, 1098: 1112, 1102: 997, 1112: 1113, 1113: 1112, 1122: 1654, 1131: 1166, 1134: 1920, 1135: 313, 1143: 1016, 1147: 1381, 1154: 1928, 1155: 1031, 1166: 1131, 1169: 1154, 1177: 1184, 1184: 1177, 1195: 1607, 1196: 1200, 1200: 1196, 1202: 408, 1203: 1305, 1205: 1207, 1207: 1205, 1214: 746, 1221: 1147, 1225: 1624, 1231: 1225, 1237: 1248, 1248: 1381, 1252: 1385, 1261: 76, 1268: 1277, 1272: 1332, 1277: 1268, 1279: 554, 1305: 1203, 1307: 1057, 1321: 1339, 1332: 1272, 1334: 1844, 1339: 1321, 1346: 1360, 1358: 1775, 1360: 1346, 1361: 961, 1367: 1381, 1373: 317, 1374: 1413, 1376: 117, 1379: 1461, 1381: 1468, 1384: 329, 1385: 1252, 1392: 1395, 1393: 1413, 1394: 699, 1395: 1392, 1398: 1931, 1403: 1686, 1405: 1573, 1409: 1413, 1413: 1409, 1414: 1415, 1415: 1414, 1429: 523, 1448: 2001, 1450: 827, 1451: 1452, 1452: 1451, 1454: 1829, 1456: 1458, 1457: 1462, 1458: 1456, 1461: 1379, 1462: 1620, 1464: 1462, 1468: 1477, 1470: 1468, 1473: 1478, 1477: 1585, 1478: 1473, 1485: 1491, 1491: 1485, 1495: 1501, 1497: 466, 1501: 1495, 1519: 1527, 1521: 16, 1523: 1908, 1524: 1767, 1525: 1541, 1527: 1519, 1528: 286, 1529: 1771, 1538: 1541, 1541: 1538, 1548: 1568, 1551: 1574, 1556: 1559, 1559: 1673, 1564: 1574, 1568: 1548, 1571: 1801, 1573: 1405, 1574: 1564, 1585: 1811, 1590: 1600, 1596: 1600, 1598: 382, 1600: 1596, 1607: 1195, 1608: 1980, 1610: 233, 1612: 1829, 1614: 834, 1615: 765, 1617: 245, 1620: 1462, 1624: 1225, 1631: 1644, 1635: 1905, 1637: 1638, 1638: 1637, 1642: 281, 1644: 1631, 1654: 1122, 1655: 83, 1659: 1662, 1662: 1659, 1667: 1681, 1668: 1713, 1673: 1559, 1677: 1681, 1679: 1795, 1681: 1700, 1686: 1403, 1687: 1804, 1695: 1946, 1700: 1681, 1713: 1668, 1715: 1730, 1716: 1721, 1717: 1736, 1721: 1716, 1727: 1730, 1730: 1727, 1736: 1717, 1738: 1754, 1749: 1738, 1754: 1738, 1767: 1524, 1768: 1775, 1769: 1910, 1771: 1769, 1775: 1768, 1788: 1795, 1790: 1798, 1791: 1799, 1793: 1928, 1795: 1788, 1798: 1790, 1799: 1791, 1801: 1571, 1804: 1687, 1811: 1585, 1822: 1851, 1824: 1872, 1829: 1612, 1833: 1824, 1834: 1840, 1838: 747, 1840: 1834, 1844: 1334, 1851: 1822, 1859: 1872, 1862: 1859, 1869: 254, 1872: 1859, 1878: 853, 1884: 1888, 1885: 1888, 1888: 1885, 1905: 1925, 1906: 1991, 1907: 1949, 1908: 1924, 1909: 1912, 1910: 1914, 1912: 1909, 1914: 1910, 1920: 1966, 1923: 1972, 1924: 1908, 1925: 1905, 1928: 1793, 1931: 1398, 1932: 1933, 1933: 1932, 1946: 1695, 1947: 659, 1949: 1966, 1953: 1970, 1955: 1970, 1957: 1959, 1959: 1957, 1962: 1058, 1966: 1949, 1970: 1955, 1972: 1923, 1976: 538, 1977: 2000, 1980: 1984, 1984: 1980, 1985: 1986, 1986: 1985, 1991: 1906, 2000: 1977, 2001: 1448}

Now it was easy. With the coordinates in the matrix, I just generated an array with all collided pairs:

Code:

['abaixo - baixo',
 'abater - bater',
 'achar - rachar',
 'adiante - diante',
 'afetivo - efetivo',
 'aflito - afoito',
 'afoito - aflito',
 'agora - amora',
 'agulha - fagulha',
 'alho - olho',
 'altitude - atitude',
 'alvo - alho',
 'amora - agora',
 'anel - anil',
 'anexo - nexo',
 'anil - anel',
 'anta - santa',
 'arca - arma',
 'areia - aveia',
 'argila - argola',
 'argola - argila',
 'arma - arca',
 'assado - passado',
 'atitude - altitude',
 'ator - fator',
 'aveia - veia',
 'babado - barbado',
 'bagulho - barulho',
 'bainha - tainha',
 'baixo - abaixo',
 'bala - vala',
 'balsa - valsa',
 'barata - batata',
 'barbado - babado',
 'barulho - bagulho',
 'batata - barata',
 'bater - abater',
 'batido - latido',
 'beato - boato',
 'beco - bico',
 'beira - feira',
 'beliche - boliche',
 'belo - selo',
 'besta - festa',
 'bico - beco',
 'bloco - floco',
 'boato - beato',
 'bode - boxe',
 'boldo - bolso',
 'bolha - rolha',
 'boliche - beliche',
 'bolo - bolso',
 'bolso - bolo',
 'bonde - bode',
 'bossa - fossa',
 'botina - rotina',
 'boxe - bode',
 'briga - brita',
 'brincar - trincar',
 'brita - briga',
 'busto - custo',
 'cabelo - camelo',
 'cabo - nabo',
 'cabuloso - fabuloso',
 'cadeira - madeira',
 'caibro - saibro',
 'caixa - faixa',
 'cajado - calado',
 'calado - ralado',
 'caldeira - cadeira',
 'camelo - cabelo',
 'carinho - marinho',
 'carneiro - carteiro',
 'caro - raro',
 'carreira - parreira',
 'carteiro - certeiro',
 'casca - lasca',
 'causar - pausar',
 'ceia - veia',
 'cenoura - censura',
 'censura - cenoura',
 'cera - fera',
 'cereja - cerveja',
 'cerrado - errado',
 'certeiro - carteiro',
 'cerveja - cereja',
 'cidade - idade',
 'cisco - risco',
 'coceira - coleira',
 'coelho - joelho',
 'coice - foice',
 'coifa - coisa',
 'coisa - coifa',
 'coleira - moleira',
 'copeiro - coveiro',
 'copo - topo',
 'corja - coruja',
 'corno - morno',
 'coruja - corja',
 'corvo - corno',
 'couro - touro',
 'coveiro - copeiro',
 'cuia - ceia',
 'cunhado - punhado',
 'custo - busto',
 'data - gata',
 'dente - rente',
 'diante - adiante',
 'dica - rica',
 'dinheiro - pinheiro',
 'doador - voador',
 'dobrado - dourado',
 'doca - dona',
 'domador - doador',
 'dona - lona',
 'dotado - lotado',
 'dourado - dobrado',
 'dublado - nublado',
 'dueto - gueto',
 'efetivo - afetivo',
 'eixo - fixo',
 'enxame - exame',
 'ereto - reto',
 'errado - cerrado',
 'escola - esmola',
 'esmola - escola',
 'exame - vexame',
 'fabuloso - cabuloso',
 'fagulha - agulha',
 'faixa - caixa',
 'farpa - ferpa',
 'fator - ator',
 'favela - fivela',
 'febre - lebre',
 'feio - seio',
 'feira - fera',
 'feixe - peixe',
 'feno - feto',
 'fera - ferpa',
 'ferpa - fera',
 'festa - fresta',
 'feto - teto',
 'figa - viga',
 'fita - figa',
 'fivela - favela',
 'fixo - eixo',
 'floco - bloco',
 'fluxo - luxo',
 'fogo - logo',
 'foice - coice',
 'folia - polia',
 'fonte - monte',
 'forno - morno',
 'forrar - torrar',
 'forte - fonte',
 'fossa - bossa',
 'freio - frevo',
 'frente - rente',
 'fresta - festa',
 'frevo - trevo',
 'fronte - frente',
 'frota - rota',
 'fundo - fungo',
 'fungo - fundo',
 'funil - fuzil',
 'furado - jurado',
 'fuzil - funil',
 'galho - alho',
 'gama - lama',
 'garoupa - garupa',
 'garupa - garoupa',
 'gasto - vasto',
 'gata - gama',
 'geada - gemada',
 'gelo - selo',
 'gemada - geada',
 'gemido - temido',
 'goela - moela',
 'goleiro - poleiro',
 'gosto - rosto',
 'gralha - tralha',
 'grato - prato',
 'grelha - orelha',
 'gruta - truta',
 'gueto - dueto',
 'gula - lula',
 'horta - porta',
 'idade - cidade',
 'ilustre - lustre',
 'incolor - indolor',
 'indolor - incolor',
 'inferno - inverno',
 'inverno - inferno',
 'isolado - solado',
 'jaca - jeca',
 'janela - panela',
 'jato - pato',
 'jeca - jaca',
 'jeito - peito',
 'joelho - coelho',
 'jogo - logo',
 'joio - jogo',
 'julho - junho',
 'junho - julho',
 'jurado - furado',
 'juro - ouro',
 'ladeira - madeira',
 'lama - gama',
 'lareira - ladeira',
 'lasca - casca',
 'laser - lazer',
 'lastro - mastro',
 'latente - patente',
 'latido - batido',
 'lazer - laser',
 'lebre - febre',
 'legado - ligado',
 'leigo - meigo',
 'lenda - tenda',
 'lente - rente',
 'lesado - pesado',
 'leste - lente',
 'levado - lesado',
 'liberal - literal',
 'licitar - limitar',
 'ligado - legado',
 'ligeiro - lixeiro',
 'limitar - licitar',
 'limpo - olimpo',
 'linda - vinda',
 'lisa - lixa',
 'literal - litoral',
 'litoral - literal',
 'lixa - rixa',
 'lixeiro - ligeiro',
 'logo - jogo',
 'loja - soja',
 'lombo - tombo',
 'lona - loja',
 'longe - monge',
 'lotado - dotado',
 'luar - suar',
 'lula - luva',
 'lustre - ilustre',
 'luva - lula',
 'luxo - fluxo',
 'machado - malhado',
 'madeira - ladeira',
 'malhado - malvado',
 'malvado - malhado',
 'mangue - sangue',
 'marcador - mercador',
 'margem - vargem',
 'marinho - carinho',
 'mastro - lastro',
 'mato - pato',
 'meia - veia',
 'meigo - leigo',
 'mercador - marcador',
 'mesa - meia',
 'miado - mimado',
 'mimado - miado',
 'moedor - roedor',
 'moela - mola',
 'mola - moela',
 'moleira - coleira',
 'molho - olho',
 'monge - monte',
 'monte - monge',
 'morno - forno',
 'moto - mato',
 'mugido - rugido',
 'munido - mugido',
 'nabo - nato',
 'nato - pato',
 'navio - pavio',
 'nexo - anexo',
 'noivo - novo',
 'nosso - osso',
 'novo - noivo',
 'nublado - dublado',
 'olho - molho',
 'olimpo - limpo',
 'orelha - ovelha',
 'osso - nosso',
 'ouro - touro',
 'ovelha - orelha',
 'padeiro - pandeiro',
 'pampa - tampa',
 'pandeiro - padeiro',
 'panela - janela',
 'papo - pato',
 'parreira - carreira',
 'parto - perto',
 'passado - assado',
 'patente - potente',
 'pato - prato',
 'pausar - causar',
 'pavio - navio',
 'pegada - pelada',
 'peito - perto',
 'peixe - feixe',
 'pelada - pegada',
 'peludo - veludo',
 'penhor - senhor',
 'pente - rente',
 'perito - perto',
 'perto - perito',
 'pesado - pescado',
 'pescado - pesado',
 'pinheiro - dinheiro',
 'poeira - zoeira',
 'poleiro - goleiro',
 'polia - polpa',
 'polpa - polia',
 'pombo - tombo',
 'ponta - porta',
 'porco - pouco',
 'porta - ponta',
 'potente - patente',
 'pouco - rouco',
 'pouso - pouco',
 'prato - preto',
 'prazo - prato',
 'pregar - prezar',
 'preto - reto',
 'prezar - pregar',
 'profeta - proveta',
 'proveta - profeta',
 'pular - puxar',
 'punhado - cunhado',
 'puxar - pular',
 'rabada - rajada',
 'rachar - achar',
 'raiar - vaiar',
 'rainha - tainha',
 'raio - raso',
 'rajada - rabada',
 'ralado - calado',
 'ralo - talo',
 'raro - raso',
 'raso - raro',
 'reator - reitor',
 'recente - repente',
 'redator - redutor',
 'redutor - sedutor',
 'regente - repente',
 'reitor - reator',
 'renda - tenda',
 'rente - pente',
 'repente - regente',
 'reto - teto',
 'rica - rixa',
 'ripa - rixa',
 'risco - cisco',
 'rixa - ripa',
 'roedor - moedor',
 'rolante - volante',
 'rolha - bolha',
 'rombo - tombo',
 'rosto - gosto',
 'rota - frota',
 'rotina - botina',
 'rouco - pouco',
 'rugido - mugido',
 'sacada - salada',
 'sadio - vadio',
 'safira - safra',
 'safra - safira',
 'saibro - caibro',
 'salada - sacada',
 'sangue - mangue',
 'santa - anta',
 'sarda - sarna',
 'sarna - sarda',
 'sebo - selo',
 'secar - socar',
 'sedutor - redutor',
 'seio - selo',
 'selar - telar',
 'selo - silo',
 'senhor - penhor',
 'sentar - tentar',
 'setor - vetor',
 'silo - selo',
 'socar - secar',
 'sogro - soro',
 'soja - soma',
 'solado - sovado',
 'soma - soja',
 'sono - soro',
 'soro - sono',
 'sovado - solado',
 'suar - suor',
 'sujar - suar',
 'suor - suar',
 'tainha - rainha',
 'taipa - tampa',
 'tala - vala',
 'talo - tala',
 'tampa - taipa',
 'tear - telar',
 'tecer - temer',
 'tecido - temido',
 'teia - veia',
 'telar - tear',
 'temer - tecer',
 'temido - tecido',
 'tenda - renda',
 'tentar - sentar',
 'teto - reto',
 'toalha - tralha',
 'toco - troco',
 'tombo - rombo',
 'topo - toco',
 'tora - tosa',
 'torrar - forrar',
 'tosa - tora',
 'touro - ouro',
 'tralha - toalha',
 'treco - troco',
 'trevo - treco',
 'trincar - brincar',
 'troco - treco',
 'truta - gruta',
 'turbo - turvo',
 'turco - turvo',
 'turvo - turco',
 'vadio - vazio',
 'vaga - zaga',
 'vagem - viagem',
 'vaiar - vazar',
 'vaidade - validade',
 'vala - valsa',
 'validade - vaidade',
 'valsa - vala',
 'vargem - virgem',
 'vasto - visto',
 'vazar - vaiar',
 'vazio - vadio',
 'veia - teia',
 'veludo - peludo',
 'vencedor - vendedor',
 'vendedor - vencedor',
 'vetor - setor',
 'vexame - exame',
 'viagem - virgem',
 'videira - viseira',
 'vieira - viseira',
 'viga - vigia',
 'vigia - viga',
 'vinda - linda',
 'virgem - viagem',
 'viseira - vieira',
 'visto - vasto',
 'voador - doador',
 'voar - zoar',
 'volante - votante',
 'votante - volante',
 'vulgo - vulto',
 'vulto - vulgo',
 'zaga - vaga',
 'zoar - voar',
 'zoeira - poeira']

The results are not as bad as they look like. They are doubled, because just as like "abaixo" is 1 distance from "baixo", "baixo" is also 1 distance from "abaixo". So we will see everything doubled here.

I learned a lot along the way. I never thought that generating those words was going to get so complicated. We will go back to word in the local board generating more words, as I deleted a lot now.

I hope this mindset can help your library/program NotATether.

Thank you both again.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: bitmover on September 06, 2020, 07:55:05 AM

NotATether , please send me your code when it's ready so I can check with it as well. What is the expected output's format of your program?

Good question. I haven't written that part of the program yet but I want it to look something along the lines of this:

Code:

$wordlistvalidator -i wordlist.txt # Name subject to change
# Generic copyright notice here
 words read.

Performing Levenshtein distance test
Evaluating Levenshtein distances between  pairs of words...
-----
Pairs with Levenshtein distance 1:    # Omitted if there are no pairs with such distance
, line , and , line    # EDIT: This is how I want all the words to be printed, with their lines, but I am editing on mobile right now so changing the other lines like this takes too long.
 
...

Pairs with Levenshtein distance 2:    # And so on, up until and including a maximum configurable by command line argument.
...

# Or don't display this part of output if no pairs with distances up to that much are found

Finished performing Levenshtein distance test
Performing matching initial characters test
Comparing first  characters between  pairs of words...
----
Pairs with matching first  identical characters:    # Omitted if there are no pairs with such identical characters
 
 
...

No pairs found with matching first  identical characters  # Displayed if there are no such pairs
Finished matching initial characters test
Performing word length test
Checking length of  words...
----
Words longer than  characters:    # Omitted if no such words


...
No words found longer than  characters  # Displayed if no such words
Finished word length test

<0/1/2/3>/3 tests passed

It is meant to be human-readable output to easily find and address the problems. I'm not designing the output to be parsed by a second program, but I'll add an API to compromise for that.

Update: the backend of tests for unique initial characters and word length tests are complete, still working on levenshtein distance test
Update 2: finished the levenshtein distance test, working on the front-end and command-line argument parser

Topic: [MERGED] BIP-39 List of words in Portuguese accepted!! - page 2. (Read 1032 times)