Pages:
Author

Topic: [MERGED] BIP-39 List of words in Portuguese accepted!! - page 2. (Read 1025 times)

legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
What would you guys think of me adding a method to fetch the wordlist from the internet, by supplying the URL? It would allow you to instantly check the wordlist if it's on places like Github without having to download it first.

e.g. In your web browser you go to the wordlist file on Github, and then click on View Raw to get this kind of link https://raw.githubusercontent.com/bitcoin/bips/03ff98d00717804723a2c4db8188c0b5cf0cbfbf/bip-0039/romanian.txt, which is a plain text file which can be downloaded and processed easily.

You can add a field to upload the file directly or to add the URL. This makes little difference, as the person who is working with the wordlist certainly have the txt file in his computer or he can access github directly.

It is more important to be able to upload the txt file imo

Where are likely sites where people would want to upload the wordlist though? At least for Github and other source control sites, it can already be tracked and committed with git, so I don't really see a point in adding such a feature.
legendary
Activity: 2352
Merit: 6089
bitcoindata.science
What would you guys think of me adding a method to fetch the wordlist from the internet, by supplying the URL? It would allow you to instantly check the wordlist if it's on places like Github without having to download it first.

e.g. In your web browser you go to the wordlist file on Github, and then click on View Raw to get this kind of link https://raw.githubusercontent.com/bitcoin/bips/03ff98d00717804723a2c4db8188c0b5cf0cbfbf/bip-0039/romanian.txt, which is a plain text file which can be downloaded and processed easily.

You can add a field to upload the file directly or to add the URL. This makes little difference, as the person who is working with the wordlist certainly have the txt file in his computer or he can access github directly.

It is more important to be able to upload the txt file imo
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Someone just made a pull request for a Romanian wordlist: https://github.com/bitcoin/bips/pull/993

That should be a wake up call for me to speed up development. I almost finished the programmatic API. Docs are easy to write, and I think I'll use Nose as my testing suite.

EDIT: Levenshtein distance section of the API completed, initial unique characters and maximum length are almost finished



What would you guys think of me adding a method to fetch the wordlist from the internet, by supplying the URL? It would allow you to instantly check the wordlist if it's on places like Github without having to download it first.

e.g. In your web browser you go to the wordlist file on Github, and then click on View Raw to get this kind of link https://raw.githubusercontent.com/bitcoin/bips/03ff98d00717804723a2c4db8188c0b5cf0cbfbf/bip-0039/romanian.txt, which is a plain text file which can be downloaded and processed easily.
legendary
Activity: 2352
Merit: 6089
bitcoindata.science
Great job, your dictionary is a good start. To utilize it I could write something like this:

Code:
for i in range(0,len(word)):
  char = word[i]
  try:
    word[i] = bitmover_table[char]
  except KeyError as e:
    pass

*So the letters in your dictionary, are they only from the spanish alphabet or did you include letters used in other European languages?

You can add more letters if you think it is necessary, I am not sure this dictionary will cover all possibilities.
About your code, I don't like to use Loops unless it is extremely necessary. Loops are computational costly and makes your code slow.

I did this in my code:
Code:
import pandas as pd
accent_dict = {...}
spanish = pd.read_csv('spanish.txt', header = None)
spanish=spanish.replace(accent_dict , regex=True)

Code will be cleaner.
1 line and faster processing instead of a loop

Quote
In the long term, I want to avoid dealing with bugs where a new wordlist is made that has characters that aren't on the list and those pass through without getting de-accented.  So I decided to use the code in https://stackoverflow.com/a/15547803/12452330 instead, it checks if the character's Unicode name has "WITH" (WITH ACCENT, WITH CIRCUMFLEX, etc.), remove that part from the name, and then reverse look up the (normal) character from the name.

As a side effect, this actually not only remove diacritics from Latin characters, it also returns diacritics from non-Latin characters like greek, arabic, as long as its unicode name has the word WITH in it, and if it doesn't, it just leaves it alone. e.g:

"ά": GREEK SMALL LETTER ALPHA WITH TONOS -> "α": GREEK SMALL LETTER ALPHA

It even strips them from symbols and emojis, which is probably not desirable but I don't need such support for this project. This stack overflow code is very handy in case I ever work on another project that needs this capability.

Bear in mind that the rest of the program fundamentally isn't designed to process non-Latin languages, since the validation rules could be completely different for them, or have more complex combining rules than 1 character/letter. This is just some food for thought for me.

A dictionary is a better approach than  using coding in my opinion
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Sometimes the implementations want the option to perform a binary search on their word list which is only possible if the array they are working with (the string[2048] here) is sorted.
This is also suggested in BIP-39 `Wordlist > c) sorted wordlists`

Good catch, reminding me of the exact number of words that should be in a wordlist. I will also implement a check which validates that there are exactly 2048 words in the list.
legendary
Activity: 1042
Merit: 2805
Bitcoin and C♯ Enthusiast
Sorting is not required

Sometimes the implementations want the option to perform a binary search on their word list which is only possible if the array they are working with (the string[2048] here) is sorted.
This is also suggested in BIP-39 `Wordlist > c) sorted wordlists`
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Sorting is not required,  but as it is extremely easy to do in every software of language (even in excel), I think it is very basic and elegant to submit your list sorted out. In no way I would submit my list in a random order, unless there would be a reason to do so.

I think you could implement a question like "your word list is not sorted. Would you like to sort it now?"

I designed my program to have as little user-interaction as possible, because it's easier to show some sort of report card that shows you the status of each test, and where specifically in each test is wrong so you can immediately go to that part of the file and fix it. In simple terms, my program lets you can control which tests to enable from the command line, it prints progresses and status messages as it performs each tests, and it tells you which tests passed and failed.

I am not comfortable with modifying the wordlist file in-place because it could have bugs that mistakenly mess up the wordlist. I've done that error too many times in other projects and I don't want to take any chances here. So I think I will just print a warning if it detects the list isn't sorted.

I made a dictionary with all special characters and replaced Spanish special characters using that dictionary.  Then I checked for common words in my list and spanish. I can share the code if you wish.

My dictionary is here
https://bitcointalksearch.org/topic/m.55131643

Great job, your dictionary is a good start. To utilize it I could write something like this:

Code:
for i in range(0,len(word)):
  char = word[i]
  try:
    word[i] = bitmover_table[char]
  except KeyError as e:
    pass

So the letters in your dictionary, are they only from the spanish alphabet or did you include letters used in other European languages? In the long term, I want to avoid dealing with bugs where a new wordlist is made that has characters that aren't on the list and those pass through without getting de-accented.  So I decided to use the code in https://stackoverflow.com/a/15547803/12452330 instead, it checks if the character's Unicode name has "WITH" (WITH ACCENT, WITH CIRCUMFLEX, etc.), remove that part from the name, and then reverse look up the (normal) character from the name.

As a side effect, this actually not only remove diacritics from Latin characters, it also returns diacritics from non-Latin characters like greek, arabic, as long as its unicode name has the word WITH in it, and if it doesn't, it just leaves it alone. e.g:

"ά": GREEK SMALL LETTER ALPHA WITH TONOS -> "α": GREEK SMALL LETTER ALPHA

It even strips them from symbols and emojis, which is probably not desirable but I don't need such support for this project. This stack overflow code is very handy in case I ever work on another project that needs this capability.

Bear in mind that the rest of the program fundamentally isn't designed to process non-Latin languages, since the validation rules could be completely different for them, or have more complex combining rules than 1 character/letter. This is just some food for thought for me.
legendary
Activity: 2352
Merit: 6089
bitcoindata.science
There are two oddities that have me stumped while writing the logic to process the word list file. All of the words in every wordlist I checked are sorted. Is it a hard requirement for submitted wordlists to be sorted? My code assumes valid wordlists might be in a random order. If sorting is absolutely required, I could introduce another check that tests whether the words are in order.
Sorting is not required,  but as it is extremely easy to do in every software of language (even in excel), I think it is very basic and elegant to submit your list sorted out. In no way I would submit my list in a random order, unless there would be a reason to do so.

I think you could implement a question like "your word list is not sorted. Would you like to sort it now?"

Quote

Second, the Spanish wordlist contains words with accents in them. https://github.com/sabotag3x/bips/blob/master/bip-0039/spanish.txt but that wordlist's rules say:

Special Spanish characters like 'ñ', 'ü', 'á', etc... are considered equal to 'n', 'u', 'a', etc... in terms of identifying a word. Therefore, there is no need to use a Spanish keyboard to introduce the passphrase, an application with the Spanish wordlist will be able to identify the words after the first 4 chars have been typed even if the chars with accents have been replaced with the equivalent without accents.

personally, I think that accepting words with accent a big mistake. I would reject it straight away. And I wouldn't use that word list

Because " àbaco" and "abaco" are different words and it could lead to some problem in some software.

Portuguese list won't have words with special characters.

Quote

Despite the list having accented characters in it, should applications accept the words typed only in the form without accents, so that Spanish wordlist processing is consistent with submitted wordlists in other Latin languages? Personally I'm not in favor of implementing a special case during validation for handling words in Spanish, because I want my implementation to be reusable.

The accented characters also make it slightly harder to check the validity of a word because I now have to convert a Latin character to its non-accented form (my current code tests if the character is between "a" and "z"). Does anyone know a Python function or module that can do this?

I made a dictionary with all special characters and replaced Spanish special characters using that dictionary.  Then I checked for common words in my list and spanish. I can share the code if you wish.

My dictionary is here
https://bitcointalksearch.org/topic/m.55131643
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
There are two oddities that have me stumped while writing the logic to process the word list file. All of the words in every wordlist I checked are sorted. Is it a hard requirement for submitted wordlists to be sorted? My code assumes valid wordlists might be in a random order. If sorting is absolutely required, I could introduce another check that tests whether the words are in order.

Second, the Spanish wordlist contains words with accents in them. https://github.com/sabotag3x/bips/blob/master/bip-0039/spanish.txt but that wordlist's rules say:

Special Spanish characters like 'ñ', 'ü', 'á', etc... are considered equal to 'n', 'u', 'a', etc... in terms of identifying a word. Therefore, there is no need to use a Spanish keyboard to introduce the passphrase, an application with the Spanish wordlist will be able to identify the words after the first 4 chars have been typed even if the chars with accents have been replaced with the equivalent without accents.

Despite the list having accented characters in it, should applications accept the words typed only in the form without accents, so that Spanish wordlist processing is consistent with submitted wordlists in other Latin languages? Personally I'm not in favor of implementing a special case during validation for handling words in Spanish, because I want my implementation to be reusable.

The accented characters also make it slightly harder to check the validity of a word because I now have to convert a Latin character to its non-accented form (my current code tests if the character is between "a" and "z"). Does anyone know a Python function or module that can do this?
legendary
Activity: 2352
Merit: 6089
bitcoindata.science
What do you think Coding Enthusiast, should we try to keep Levenshtein  distance > 1? That is not an easy tasks but certainly doable.
I think you should try to avoid it if possible. It's definitely beneficial to keep all the words as distinct as possible. For example in the case above simply a bad handwriting could cause issues between letter 'c' and 't' in "cable" and "table" and having multiple one of these mistakes in a mnemonic could potentially make recovery impossible.

Thanks for your suggestion. We decided to keep with that restriction (removing all Levenshtein distance =1). Our wordlist is going to be the one with the most restricted rules.
French wordlist followed Levenshtein distance =1 rule, however they didn't worry about repeting words from others lists like we did.

https://github.com/bitcoin/bips/pull/152#issuecomment-412618598


I hope our list will be quickly accepted. We did a nice work.

P.S. as for the progress of my wordlist validator program, I have implemented the command line arguments and logging functionality. I am currently implementing a progress bar so people can know how far in validation it's in. When that's finished the program will be complete, I'll just have to write a stable API, documentation and unit testing for the whole thing and then it should be ready for publishing.

Share with us this code when you are done. There are still other languages to make a wordlist. and your program may also be used in other projects that we don't know of yet.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Interesting find. What do you think about about Jaro Similarity? No idea how it works, but the output is fixed from 0.0 to 1.0 and i think it'd be easier to determine the threshold.
  • The Levenshtein algorithm considers 3 factors: deletion, insertion and substitution
  • Damerau–Levenshtein adds transposition to the above 3.
  • Meanwhile the Jaro algorithm only considers similarity (characters in common but only in a short distance)
  • Jaro-Winkler is the same as Jaro but gives a more favorable result when the characters in common are from the beginning of the string (ie. AAAXYZ and AAAUVW are more similar than XZYAAA and UVWAAA).

I believe the results could pretty much give the same conclusion about the word lists BIP-39 is dealing with. But I think Levenshtein may be better here. For example Jaro-Winkler gives 0.866 for "cable" and "table" (closer to 1 means less similar) while Levenshtein returns 1 which is a much better indication of it being bad.

I am against using Jaro-Winkler similarity for measuring distances because it is tainted by its weighing earlier characters more. I feel like it's trying to take on the task of both measuring distance and counting initial unique characters, but it is not effective for measuring either of them because it just adds the distance metric and a very scaled down initial character uniqueness measurement together. IMHO adding two metrics together just ruins the measurement.

Jaro similarity is a little better, it just measures distance but I notice that character swaps have less weighting on the metric than the presence of unique characters in either two words, which makes sense if you are feeding a program with input that has several similar words in it, but interchanges between adjacent characters and deletions are the most common mistakes people make when writing from a wordlist. Plus the percentage metric doesn't lend itself well to quantifying the number of character replacements you need to get from one distance to another, at least by the human brain. I can't just say, "for a Jaro distance of 0.7 I need to change characters to make it 0.6".

For typing there are also typos made not by swapping but by typing the adjacent character on the Qwerty keyboard. A typo is just a substitution, which can be modeled by deletion and insert pair, additional typos can replace one of the deletes and one of the inserts with a swap*. There may already be production algorithms that take proximity of the neighboring Qwerty characters into account when measuring similarity, measuring insert/delete pairs of nearby keyboard characters more harshly than distant characters, and they do exist since search engines can detect typos. And I think that should be the goal when making a wordlist, to filter out as many opportunities to make typos as possible. The guidelines for similarity checking were only created because a small group can't be expected to make such a sophisticated checker Smiley

*That's why I think counting swaps with one point instead of two points of insert/delete is bad for distance measurements, as we are falsely making the word pair look more unique. Hence my argument against using Damerau-Levenshtein.

So for simpletons like us, none of the alternative algorithms are good for our needs, and Levenshtein is out best measuring ruler that is not complex to implement.

What do you think Coding Enthusiast, should we try to keep Levenshtein  distance > 1? That is not an easy task but certainly doable.

Levenshtein distance of one means only one substitution needs to be made, like "fish" --> "fist", which has a risk of being spelt incorrectly like Coding Enthusiast mentioned. Insertions and deletions are harder to get wrong though, because a user has to subconsciously type an extra character or omit one. so if you absolutely must use distance 1 word pairs then use ones with an extra or missing letter. But it is unlikely that users will miswrite 2 substituted characters wrong.

P.S. as for the progress of my wordlist validator program, I have implemented the command line arguments and logging functionality. I am currently implementing a progress bar so people can know how far in validation it's in. When that's finished the program will be complete, I'll just have to write a stable API, documentation and unit testing for the whole thing and then it should be ready for publishing.
legendary
Activity: 1042
Merit: 2805
Bitcoin and C♯ Enthusiast
What do you think Coding Enthusiast, should we try to keep Levenshtein  distance > 1? That is not an easy tasks but certainly doable.
I think you should try to avoid it if possible. It's definitely beneficial to keep all the words as distinct as possible. For example in the case above simply a bad handwriting could cause issues between letter 'c' and 't' in "cable" and "table" and having multiple one of these mistakes in a mnemonic could potentially make recovery impossible.
legendary
Activity: 2352
Merit: 6089
bitcoindata.science
I just added the code to compute Levenshtein distance for all existing BIP-39 lists to Bitcoin.Net and the first thing I noticed is that the English list contains words such as "able", "cable", "table", "unable", "viable" with very short distances (1 for the first three and 2 for the other two).

This is a great find.
distance 2 certainly is ok because it would be too restrictive, but I didn't know if a distance of 1 would be acceptable.

Looking carefully at the https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md I found that only French is worried about Levenshtein distance

Quote
French
10. No very similar words with 1 letter of difference.
https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md#french


This is same as Levenshtein  distance > 1.

What do you think Coding Enthusiast, should we try to keep Levenshtein  distance > 1? That is not an easy task but certainly doable.
legendary
Activity: 1042
Merit: 2805
Bitcoin and C♯ Enthusiast
Interesting find. What do you think about about Jaro Similarity? No idea how it works, but the output is fixed from 0.0 to 1.0 and i think it'd be easier to determine the threshold.
  • The Levenshtein algorithm considers 3 factors: deletion, insertion and substitution
  • Damerau–Levenshtein adds transposition to the above 3.
  • Meanwhile the Jaro algorithm only considers similarity (characters in common but only in a short distance)
  • Jaro-Winkler is the same as Jaro but gives a more favorable result when the characters in common are from the beginning of the string (ie. AAAXYZ and AAAUVW are more similar than XZYAAA and UVWAAA).

I believe the results could pretty much give the same conclusion about the word lists BIP-39 is dealing with. But I think Levenshtein may be better here. For example Jaro-Winkler gives 0.866 for "cable" and "table" (closer to 1 means less similar) while Levenshtein returns 1 which is a much better indication of it being bad.

PS. You can run Jaro-Winkler algorithm here on sharplab just change the s1 and s2 in Main() method.
legendary
Activity: 1042
Merit: 2805
Bitcoin and C♯ Enthusiast
I just added the code to compute Levenshtein distance for all existing BIP-39 lists to Bitcoin.Net and the first thing I noticed is that the English list contains words such as "able", "cable", "table", "unable", "viable" with very short distances (1 for the first three and 2 for the other two).
Other languages don't seem to be any better. Here are only some of the example words not all:
Italian first word in the list is "abaco" which has a similar one "baco" or "sino", "asino".
French seems better but it has words with distance=2 like "apaiser", "abaisser"
Spanish has "bono", "abono" and "abrazo", "brazo"
Czech has words with distance=2 like "abeceda", "beseda" and "adresa", "agrese"
Japanese has "あいさつ", "かいさつ" and "あきる", "あける"
Korean has "가격", "간격"
Chinese results don't make much sense but since the last 3 are complicated languages I'm not sure if the Levenshtein distance is even valid for them.
legendary
Activity: 2352
Merit: 6089
bitcoindata.science
@ETFBitcoin and @NotATether

I wanna share this with you both. I did it thanks to your suggestions.

Following ETFBitcoin library suggestion (which is pretty fast btw) I made a code that generated this matrix (we had to delete some words and now we only have 2005).



as you can see, this matrix shows levenshtein distance of all 2005 words with each other in a loop. As 0 compared to 0 is the same word, it is distance is 0. As 1 compared to 1 is also 0, you can see a diagonal line comparing the same words as zero until the end of the last row.

I was able to identify all the values where distance was 1 and I generated a dictionary with those coordinates in the matrix:
Code:
{1: 164, 4: 182, 16: 1521, 23: 516, 31: 567, 32: 33, 33: 32, 35: 67, 39: 677, 51: 1305, 57: 126, 60: 51, 67: 35, 75: 78, 76: 1261, 78: 75, 83: 1655, 103: 107, 104: 140, 105: 106, 106: 105, 107: 103, 117: 1376, 126: 57, 128: 690, 140: 1928, 148: 176, 158: 178, 161: 1767, 164: 1, 166: 1910, 169: 1914, 175: 181, 176: 148, 178: 158, 181: 175, 182: 4, 183: 1019, 187: 221, 188: 205, 190: 697, 194: 234, 195: 1681, 200: 708, 205: 188, 220: 730, 221: 187, 228: 247, 231: 236, 233: 1610, 234: 194, 235: 236, 236: 235, 238: 228, 244: 750, 245: 1617, 247: 228, 252: 255, 254: 1869, 255: 252, 266: 471, 270: 292, 272: 1237, 274: 672, 280: 1102, 281: 1642, 283: 678, 284: 286, 286: 1528, 287: 280, 292: 270, 313: 1135, 314: 318, 315: 1538, 317: 1373, 318: 347, 321: 1014, 329: 1384, 338: 1928, 340: 341, 341: 340, 343: 703, 345: 348, 346: 621, 347: 318, 348: 345, 371: 886, 382: 1598, 395: 408, 399: 976, 402: 737, 403: 405, 405: 403, 408: 1202, 426: 446, 427: 1833, 432: 438, 433: 1214, 438: 432, 439: 433, 443: 1844, 446: 426, 461: 338, 466: 1497, 471: 266, 474: 795, 493: 1573, 516: 23, 517: 1590, 523: 1429, 538: 1976, 539: 550, 540: 544, 542: 538, 544: 1076, 549: 1082, 550: 539, 554: 1279, 557: 857, 567: 31, 568: 726, 613: 659, 619: 1585, 621: 346, 627: 635, 635: 627, 659: 1947, 672: 274, 677: 39, 678: 283, 687: 705, 690: 128, 691: 725, 694: 1027, 696: 1677, 697: 703, 699: 1394, 702: 710, 703: 705, 705: 703, 708: 760, 710: 1811, 715: 1957, 724: 715, 725: 691, 726: 568, 730: 220, 733: 1095, 735: 1073, 737: 402, 740: 1451, 742: 1207, 746: 1214, 747: 1838, 748: 742, 750: 244, 758: 761, 759: 1573, 760: 708, 761: 1862, 764: 759, 765: 1615, 768: 769, 769: 768, 770: 773, 771: 991, 773: 770, 782: 51, 785: 1002, 790: 792, 792: 790, 794: 1923, 795: 785, 798: 802, 801: 1681, 802: 798, 803: 1799, 825: 1196, 827: 1450, 834: 1614, 839: 1851, 842: 1468, 846: 1321, 853: 1878, 857: 557, 859: 1090, 882: 1458, 886: 371, 896: 1092, 912: 915, 915: 912, 919: 945, 945: 919, 955: 1717, 958: 971, 961: 1361, 969: 1381, 971: 958, 973: 1393, 976: 399, 978: 1073, 979: 978, 987: 989, 989: 987, 991: 771, 993: 1334, 997: 1102, 1002: 785, 1010: 997, 1014: 321, 1015: 1025, 1016: 1143, 1018: 1379, 1019: 183, 1025: 1015, 1027: 694, 1028: 1053, 1031: 1155, 1036: 1801, 1038: 1573, 1042: 1414, 1044: 1038, 1046: 1042, 1048: 1065, 1051: 1055, 1053: 1028, 1054: 1070, 1055: 1051, 1057: 1307, 1058: 1962, 1063: 1069, 1065: 1066, 1066: 1065, 1069: 1600, 1070: 1054, 1073: 978, 1074: 1716, 1075: 1829, 1076: 1074, 1077: 1205, 1082: 549, 1087: 1738, 1090: 1094, 1092: 896, 1094: 1090, 1095: 733, 1098: 1112, 1102: 997, 1112: 1113, 1113: 1112, 1122: 1654, 1131: 1166, 1134: 1920, 1135: 313, 1143: 1016, 1147: 1381, 1154: 1928, 1155: 1031, 1166: 1131, 1169: 1154, 1177: 1184, 1184: 1177, 1195: 1607, 1196: 1200, 1200: 1196, 1202: 408, 1203: 1305, 1205: 1207, 1207: 1205, 1214: 746, 1221: 1147, 1225: 1624, 1231: 1225, 1237: 1248, 1248: 1381, 1252: 1385, 1261: 76, 1268: 1277, 1272: 1332, 1277: 1268, 1279: 554, 1305: 1203, 1307: 1057, 1321: 1339, 1332: 1272, 1334: 1844, 1339: 1321, 1346: 1360, 1358: 1775, 1360: 1346, 1361: 961, 1367: 1381, 1373: 317, 1374: 1413, 1376: 117, 1379: 1461, 1381: 1468, 1384: 329, 1385: 1252, 1392: 1395, 1393: 1413, 1394: 699, 1395: 1392, 1398: 1931, 1403: 1686, 1405: 1573, 1409: 1413, 1413: 1409, 1414: 1415, 1415: 1414, 1429: 523, 1448: 2001, 1450: 827, 1451: 1452, 1452: 1451, 1454: 1829, 1456: 1458, 1457: 1462, 1458: 1456, 1461: 1379, 1462: 1620, 1464: 1462, 1468: 1477, 1470: 1468, 1473: 1478, 1477: 1585, 1478: 1473, 1485: 1491, 1491: 1485, 1495: 1501, 1497: 466, 1501: 1495, 1519: 1527, 1521: 16, 1523: 1908, 1524: 1767, 1525: 1541, 1527: 1519, 1528: 286, 1529: 1771, 1538: 1541, 1541: 1538, 1548: 1568, 1551: 1574, 1556: 1559, 1559: 1673, 1564: 1574, 1568: 1548, 1571: 1801, 1573: 1405, 1574: 1564, 1585: 1811, 1590: 1600, 1596: 1600, 1598: 382, 1600: 1596, 1607: 1195, 1608: 1980, 1610: 233, 1612: 1829, 1614: 834, 1615: 765, 1617: 245, 1620: 1462, 1624: 1225, 1631: 1644, 1635: 1905, 1637: 1638, 1638: 1637, 1642: 281, 1644: 1631, 1654: 1122, 1655: 83, 1659: 1662, 1662: 1659, 1667: 1681, 1668: 1713, 1673: 1559, 1677: 1681, 1679: 1795, 1681: 1700, 1686: 1403, 1687: 1804, 1695: 1946, 1700: 1681, 1713: 1668, 1715: 1730, 1716: 1721, 1717: 1736, 1721: 1716, 1727: 1730, 1730: 1727, 1736: 1717, 1738: 1754, 1749: 1738, 1754: 1738, 1767: 1524, 1768: 1775, 1769: 1910, 1771: 1769, 1775: 1768, 1788: 1795, 1790: 1798, 1791: 1799, 1793: 1928, 1795: 1788, 1798: 1790, 1799: 1791, 1801: 1571, 1804: 1687, 1811: 1585, 1822: 1851, 1824: 1872, 1829: 1612, 1833: 1824, 1834: 1840, 1838: 747, 1840: 1834, 1844: 1334, 1851: 1822, 1859: 1872, 1862: 1859, 1869: 254, 1872: 1859, 1878: 853, 1884: 1888, 1885: 1888, 1888: 1885, 1905: 1925, 1906: 1991, 1907: 1949, 1908: 1924, 1909: 1912, 1910: 1914, 1912: 1909, 1914: 1910, 1920: 1966, 1923: 1972, 1924: 1908, 1925: 1905, 1928: 1793, 1931: 1398, 1932: 1933, 1933: 1932, 1946: 1695, 1947: 659, 1949: 1966, 1953: 1970, 1955: 1970, 1957: 1959, 1959: 1957, 1962: 1058, 1966: 1949, 1970: 1955, 1972: 1923, 1976: 538, 1977: 2000, 1980: 1984, 1984: 1980, 1985: 1986, 1986: 1985, 1991: 1906, 2000: 1977, 2001: 1448}

Now it was easy. With the coordinates in the matrix, I just generated an array with all collided pairs:
Code:
['abaixo - baixo',
 'abater - bater',
 'achar - rachar',
 'adiante - diante',
 'afetivo - efetivo',
 'aflito - afoito',
 'afoito - aflito',
 'agora - amora',
 'agulha - fagulha',
 'alho - olho',
 'altitude - atitude',
 'alvo - alho',
 'amora - agora',
 'anel - anil',
 'anexo - nexo',
 'anil - anel',
 'anta - santa',
 'arca - arma',
 'areia - aveia',
 'argila - argola',
 'argola - argila',
 'arma - arca',
 'assado - passado',
 'atitude - altitude',
 'ator - fator',
 'aveia - veia',
 'babado - barbado',
 'bagulho - barulho',
 'bainha - tainha',
 'baixo - abaixo',
 'bala - vala',
 'balsa - valsa',
 'barata - batata',
 'barbado - babado',
 'barulho - bagulho',
 'batata - barata',
 'bater - abater',
 'batido - latido',
 'beato - boato',
 'beco - bico',
 'beira - feira',
 'beliche - boliche',
 'belo - selo',
 'besta - festa',
 'bico - beco',
 'bloco - floco',
 'boato - beato',
 'bode - boxe',
 'boldo - bolso',
 'bolha - rolha',
 'boliche - beliche',
 'bolo - bolso',
 'bolso - bolo',
 'bonde - bode',
 'bossa - fossa',
 'botina - rotina',
 'boxe - bode',
 'briga - brita',
 'brincar - trincar',
 'brita - briga',
 'busto - custo',
 'cabelo - camelo',
 'cabo - nabo',
 'cabuloso - fabuloso',
 'cadeira - madeira',
 'caibro - saibro',
 'caixa - faixa',
 'cajado - calado',
 'calado - ralado',
 'caldeira - cadeira',
 'camelo - cabelo',
 'carinho - marinho',
 'carneiro - carteiro',
 'caro - raro',
 'carreira - parreira',
 'carteiro - certeiro',
 'casca - lasca',
 'causar - pausar',
 'ceia - veia',
 'cenoura - censura',
 'censura - cenoura',
 'cera - fera',
 'cereja - cerveja',
 'cerrado - errado',
 'certeiro - carteiro',
 'cerveja - cereja',
 'cidade - idade',
 'cisco - risco',
 'coceira - coleira',
 'coelho - joelho',
 'coice - foice',
 'coifa - coisa',
 'coisa - coifa',
 'coleira - moleira',
 'copeiro - coveiro',
 'copo - topo',
 'corja - coruja',
 'corno - morno',
 'coruja - corja',
 'corvo - corno',
 'couro - touro',
 'coveiro - copeiro',
 'cuia - ceia',
 'cunhado - punhado',
 'custo - busto',
 'data - gata',
 'dente - rente',
 'diante - adiante',
 'dica - rica',
 'dinheiro - pinheiro',
 'doador - voador',
 'dobrado - dourado',
 'doca - dona',
 'domador - doador',
 'dona - lona',
 'dotado - lotado',
 'dourado - dobrado',
 'dublado - nublado',
 'dueto - gueto',
 'efetivo - afetivo',
 'eixo - fixo',
 'enxame - exame',
 'ereto - reto',
 'errado - cerrado',
 'escola - esmola',
 'esmola - escola',
 'exame - vexame',
 'fabuloso - cabuloso',
 'fagulha - agulha',
 'faixa - caixa',
 'farpa - ferpa',
 'fator - ator',
 'favela - fivela',
 'febre - lebre',
 'feio - seio',
 'feira - fera',
 'feixe - peixe',
 'feno - feto',
 'fera - ferpa',
 'ferpa - fera',
 'festa - fresta',
 'feto - teto',
 'figa - viga',
 'fita - figa',
 'fivela - favela',
 'fixo - eixo',
 'floco - bloco',
 'fluxo - luxo',
 'fogo - logo',
 'foice - coice',
 'folia - polia',
 'fonte - monte',
 'forno - morno',
 'forrar - torrar',
 'forte - fonte',
 'fossa - bossa',
 'freio - frevo',
 'frente - rente',
 'fresta - festa',
 'frevo - trevo',
 'fronte - frente',
 'frota - rota',
 'fundo - fungo',
 'fungo - fundo',
 'funil - fuzil',
 'furado - jurado',
 'fuzil - funil',
 'galho - alho',
 'gama - lama',
 'garoupa - garupa',
 'garupa - garoupa',
 'gasto - vasto',
 'gata - gama',
 'geada - gemada',
 'gelo - selo',
 'gemada - geada',
 'gemido - temido',
 'goela - moela',
 'goleiro - poleiro',
 'gosto - rosto',
 'gralha - tralha',
 'grato - prato',
 'grelha - orelha',
 'gruta - truta',
 'gueto - dueto',
 'gula - lula',
 'horta - porta',
 'idade - cidade',
 'ilustre - lustre',
 'incolor - indolor',
 'indolor - incolor',
 'inferno - inverno',
 'inverno - inferno',
 'isolado - solado',
 'jaca - jeca',
 'janela - panela',
 'jato - pato',
 'jeca - jaca',
 'jeito - peito',
 'joelho - coelho',
 'jogo - logo',
 'joio - jogo',
 'julho - junho',
 'junho - julho',
 'jurado - furado',
 'juro - ouro',
 'ladeira - madeira',
 'lama - gama',
 'lareira - ladeira',
 'lasca - casca',
 'laser - lazer',
 'lastro - mastro',
 'latente - patente',
 'latido - batido',
 'lazer - laser',
 'lebre - febre',
 'legado - ligado',
 'leigo - meigo',
 'lenda - tenda',
 'lente - rente',
 'lesado - pesado',
 'leste - lente',
 'levado - lesado',
 'liberal - literal',
 'licitar - limitar',
 'ligado - legado',
 'ligeiro - lixeiro',
 'limitar - licitar',
 'limpo - olimpo',
 'linda - vinda',
 'lisa - lixa',
 'literal - litoral',
 'litoral - literal',
 'lixa - rixa',
 'lixeiro - ligeiro',
 'logo - jogo',
 'loja - soja',
 'lombo - tombo',
 'lona - loja',
 'longe - monge',
 'lotado - dotado',
 'luar - suar',
 'lula - luva',
 'lustre - ilustre',
 'luva - lula',
 'luxo - fluxo',
 'machado - malhado',
 'madeira - ladeira',
 'malhado - malvado',
 'malvado - malhado',
 'mangue - sangue',
 'marcador - mercador',
 'margem - vargem',
 'marinho - carinho',
 'mastro - lastro',
 'mato - pato',
 'meia - veia',
 'meigo - leigo',
 'mercador - marcador',
 'mesa - meia',
 'miado - mimado',
 'mimado - miado',
 'moedor - roedor',
 'moela - mola',
 'mola - moela',
 'moleira - coleira',
 'molho - olho',
 'monge - monte',
 'monte - monge',
 'morno - forno',
 'moto - mato',
 'mugido - rugido',
 'munido - mugido',
 'nabo - nato',
 'nato - pato',
 'navio - pavio',
 'nexo - anexo',
 'noivo - novo',
 'nosso - osso',
 'novo - noivo',
 'nublado - dublado',
 'olho - molho',
 'olimpo - limpo',
 'orelha - ovelha',
 'osso - nosso',
 'ouro - touro',
 'ovelha - orelha',
 'padeiro - pandeiro',
 'pampa - tampa',
 'pandeiro - padeiro',
 'panela - janela',
 'papo - pato',
 'parreira - carreira',
 'parto - perto',
 'passado - assado',
 'patente - potente',
 'pato - prato',
 'pausar - causar',
 'pavio - navio',
 'pegada - pelada',
 'peito - perto',
 'peixe - feixe',
 'pelada - pegada',
 'peludo - veludo',
 'penhor - senhor',
 'pente - rente',
 'perito - perto',
 'perto - perito',
 'pesado - pescado',
 'pescado - pesado',
 'pinheiro - dinheiro',
 'poeira - zoeira',
 'poleiro - goleiro',
 'polia - polpa',
 'polpa - polia',
 'pombo - tombo',
 'ponta - porta',
 'porco - pouco',
 'porta - ponta',
 'potente - patente',
 'pouco - rouco',
 'pouso - pouco',
 'prato - preto',
 'prazo - prato',
 'pregar - prezar',
 'preto - reto',
 'prezar - pregar',
 'profeta - proveta',
 'proveta - profeta',
 'pular - puxar',
 'punhado - cunhado',
 'puxar - pular',
 'rabada - rajada',
 'rachar - achar',
 'raiar - vaiar',
 'rainha - tainha',
 'raio - raso',
 'rajada - rabada',
 'ralado - calado',
 'ralo - talo',
 'raro - raso',
 'raso - raro',
 'reator - reitor',
 'recente - repente',
 'redator - redutor',
 'redutor - sedutor',
 'regente - repente',
 'reitor - reator',
 'renda - tenda',
 'rente - pente',
 'repente - regente',
 'reto - teto',
 'rica - rixa',
 'ripa - rixa',
 'risco - cisco',
 'rixa - ripa',
 'roedor - moedor',
 'rolante - volante',
 'rolha - bolha',
 'rombo - tombo',
 'rosto - gosto',
 'rota - frota',
 'rotina - botina',
 'rouco - pouco',
 'rugido - mugido',
 'sacada - salada',
 'sadio - vadio',
 'safira - safra',
 'safra - safira',
 'saibro - caibro',
 'salada - sacada',
 'sangue - mangue',
 'santa - anta',
 'sarda - sarna',
 'sarna - sarda',
 'sebo - selo',
 'secar - socar',
 'sedutor - redutor',
 'seio - selo',
 'selar - telar',
 'selo - silo',
 'senhor - penhor',
 'sentar - tentar',
 'setor - vetor',
 'silo - selo',
 'socar - secar',
 'sogro - soro',
 'soja - soma',
 'solado - sovado',
 'soma - soja',
 'sono - soro',
 'soro - sono',
 'sovado - solado',
 'suar - suor',
 'sujar - suar',
 'suor - suar',
 'tainha - rainha',
 'taipa - tampa',
 'tala - vala',
 'talo - tala',
 'tampa - taipa',
 'tear - telar',
 'tecer - temer',
 'tecido - temido',
 'teia - veia',
 'telar - tear',
 'temer - tecer',
 'temido - tecido',
 'tenda - renda',
 'tentar - sentar',
 'teto - reto',
 'toalha - tralha',
 'toco - troco',
 'tombo - rombo',
 'topo - toco',
 'tora - tosa',
 'torrar - forrar',
 'tosa - tora',
 'touro - ouro',
 'tralha - toalha',
 'treco - troco',
 'trevo - treco',
 'trincar - brincar',
 'troco - treco',
 'truta - gruta',
 'turbo - turvo',
 'turco - turvo',
 'turvo - turco',
 'vadio - vazio',
 'vaga - zaga',
 'vagem - viagem',
 'vaiar - vazar',
 'vaidade - validade',
 'vala - valsa',
 'validade - vaidade',
 'valsa - vala',
 'vargem - virgem',
 'vasto - visto',
 'vazar - vaiar',
 'vazio - vadio',
 'veia - teia',
 'veludo - peludo',
 'vencedor - vendedor',
 'vendedor - vencedor',
 'vetor - setor',
 'vexame - exame',
 'viagem - virgem',
 'videira - viseira',
 'vieira - viseira',
 'viga - vigia',
 'vigia - viga',
 'vinda - linda',
 'virgem - viagem',
 'viseira - vieira',
 'visto - vasto',
 'voador - doador',
 'voar - zoar',
 'volante - votante',
 'votante - volante',
 'vulgo - vulto',
 'vulto - vulgo',
 'zaga - vaga',
 'zoar - voar',
 'zoeira - poeira']

The results are not as bad as they look like. They are doubled, because just as like "abaixo" is 1 distance from "baixo", "baixo" is also 1 distance from "abaixo". So we will see everything doubled here.

I learned a lot along the way. I never thought that generating those words was going to get so complicated. We will go back to word in the local board generating more words, as I deleted a lot now.

I hope this mindset can help your library/program NotATether.

Thank you both again.
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
NotATether , please send me your code when it's ready so I can check with it as well. What is the expected output's format of your program?

Good question. I haven't written that part of the program yet but I want it to look something along the lines of this:

Code:

It is meant to be human-readable output to easily find and address the problems. I'm not designing the output to be parsed by a second program, but I'll add an API to compromise for that.


Update: the backend of tests for unique initial characters and word length tests are complete, still working on levenshtein distance test
Update 2: finished the levenshtein distance test, working on the front-end and command-line argument parser
legendary
Activity: 2352
Merit: 6089
bitcoindata.science
Hello NotATether and ETFbitcoin

Thanks for your suggestions

I think I will make  a 2048x2048 matrix like this one. I think it is the best option.
https://stackoverflow.com/questions/47152344/how-to-calculate-levenshtein-ratio-distance-for-rows-in-my-column-in-python

In a matrix I can compare each 2048 words with each other.

NotATether , please send me your code when it's ready so I can check with it as well. What is the expected output's format of your program?
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
@bitmover

I am developing a python program that evaluates Latin wordlists for their word length, number of similar characters at the beginning and the Levenshtein distance between each two words. I have not finished it yet but I plan on putting the code on Github and PyPI soon. Hopefully it won't stall like some of my previous projects have.

If you are still looking for or writing a script for your task, keep doing that since I don't have an ETA for when it will be done, though my program is relatively simple and not complicated to code. I hope I can finish it within a few days.

The idea is that it can be used by future groups making wordlists to validate their words. so it will still be useful even if your group has finished checking the Portuguese list.
legendary
Activity: 2870
Merit: 7490
Crypto Swap Exchange
Good initiative, i hope it will be accepted quickly Smiley

I will try to make or find a Levenshtein distance script in python to check our list.

I would recommend Jellyfish library (https://pypi.org/project/jellyfish/) which have better performance (since it's wrapper of C library).
Pages:
Jump to: