Author

Topic: [Anti-plagiarism] The full list of homographs (Read 254 times)

legendary
Activity: 2240
Merit: 3150
₿uy / $ell ..oeleo ;(
September 05, 2018, 03:20:27 PM
#6
It's good to have more alternatives, good work OP no matter that the problem is fixed, theymos will probably add those that are missing from the nkampala's list.
legendary
Activity: 2674
Merit: 2334
September 05, 2018, 10:27:07 AM
#5
This has recently been address by theymos following posts by iasenko over the last few months. See here: https://bitcointalksearch.org/topic/fixedhomographs-are-fixedthank-you-theymos-again-see-my-report-4967143

Yes, the posts of iasenko encouraged me to create this table. It seems that I was late a little bit. Smiley

As I can see, the list of nkampala does not contain these homographs:
0x03E2 (994): ϲ -> c
0x03E9 (1001): C -> C
0x041A (1050): К -> K
0x04C0 (1216): Ӏ -> I
0x04CF (1231): l -> l
0x051A (1306): Q -> Q
0x051C (1308): W -> W

So I guess that my list can also be useful.

By the way, I think that the following symbols posted by nkampala are significantly different from ASCII characters:
Code:
Ғ -> F
Լ -> L
ε -> e
ι -> i
κ -> k
յ -> j
η -> n
ρ -> p
զ -> q
τ -> t
υ -> u
ν -> v
ω -> w
χ -> x
γ -> y



Essentially, all homographs that look the same as Latin characters are retroactively auto-replaced with the Latin characters in the English boards.

As I understood, the posted homographs are stored in the forum database, but they are replaced with ASCII characters at displaying on non-local sections. In my opinion, since homographs are only used in other english boards, the Meta section should allow to show homographs for reporting or some administrative reasons.

Anyways, I will look for other homographs in the Unicode table later.
legendary
Activity: 2268
Merit: 18771
September 03, 2018, 07:40:19 AM
#4
This has recently been address by theymos following posts by iasenko over the last few months. See here: https://bitcointalksearch.org/topic/fixedhomographs-are-fixedthank-you-theymos-again-see-my-report-4967143

Essentially, all homographs that look the same as Latin characters are retroactively auto-replaced with the Latin characters in the English boards.
legendary
Activity: 2674
Merit: 2334
September 03, 2018, 07:11:27 AM
#3
Reserved.
legendary
Activity: 2674
Merit: 2334
September 03, 2018, 07:11:00 AM
#2
The list of homographs for ASCII:

ASCII char  Unicode number  Comment       Verdana   Arial        Sans Serif  Courier New
1)A (65)0x0391 (913)GreekA  AA  AA  AA  A
2)B (66)0x0392 (914)GreekB  BB  BB  BB  B
3)E (69)0x0395 (917)GreekE  EE  EE  EE  E
4)Z (90)0x0396 (918)GreekZ  ZZ  ZZ  ZZ  Z
5)H (72)0x0397 (919)GreekH  HH  HH  HH  H
6)I (73)0x0399 (921)GreekI  II  II  II  I
7)K (75)0x039A (922)GreekK  KK  KK  KK  K
8)M (77)0x039C (924)GreekM  MM  MM  MM  M
9)N (78)0x039D (925)GreekN  NN  NN  NN  N
10)O (79)0x039F (927)GreekO  OO  OO  OO  O
11)P (80)0x03A1 (929)GreekP  PP  PP  PP  P
12)T (84)0x03A4 (932)GreekT  TT  TT  TT  T
13)Y (89)0x03A5 (933)GreekY  YY  YY  YY  Y
14)X (88)0x03A7 (935)GreekX  XX  XX  XX  X
15)o (111)0x03BF (959)Greeko  oo  oo  oo  o
16)c (99) [4]0x03E2 (994)Greekc  ϲc  ϲc  ϲc  ϲ
17)j (106) [2]0x03E3 (995)Macedonianj  ϳj  ϳj  ϳj  ϳ
18)C (67) [4]0x03E9 (1001)C  ϹC  ϹC  ϹC  Ϲ
19)S (83)0x0405 (1029)MacedonianS  SS  SS  SS  S
20)I (73)0x0406 (1030)I  II  II  II  I
21)J (74)0x0408 (1032)MacedonianJ  JJ  JJ  JJ  J
22)A (65)0x0410 (1040)RussianA  AA  AA  AA  A
23)B (66)0x0412 (1042)RussianB  BB  BB  BB  B
24)E (69)0x0415 (1045)RussianE  EE  EE  EE  E
25)K (75) [1]0x041A (1050)RussianK  КK  КK  КK  К
26)M (77)0x041C (1052)RussianM  MM  MM  MM  M
27)H (72)0x041D (1053)RussianH  HH  HH  HH  H
28)O (79)0x041E (1054)RussianO  OO  OO  OO  O
29)P (80)0x0420 (1056)RussianP  PP  PP  PP  P
30)C (67)0x0421 (1057)RussianC  CC  CC  CC  C
31)T (84)0x0422 (1058)RussianT  TT  TT  TT  T
32)X (88)0x0425 (1061)RussianX  XX  XX  XX  X
33)a (97)0x0430 (1072)Russiana  aa  aa  aa  a
34)e (101)0x0435 (1077)Russiane  ee  ee  ee  e
35)o (111)0x043E (1086)Russiano  oo  oo  oo  o
36)p (112)0x0440 (1088)Russianp  pp  pp  pp  p
37)c (99)0x0441 (1089)Russianc  cc  cc  cc  c
38)y (121) [3]0x0443 (1091)Russiany  yy  yy  yy  y
39)x (120)0x0445 (1093)Russianx  xx  xx  xx  x
40)s (115)0x0455 (1109)Macedonians  ss  ss  ss  s
41)i (105)0x0456 (1110)i  ii  ii  ii  i
42)j (106)0x0458 (1112)Macedonianj  jj  jj  jj  j
43)Y (89)0x04AE (1198)Y  YY  YY  YY  Y
44)h (104)0x04BB (1211)h  hh  hh  hh  h
45)I (73) [2]0x04C0 (1216)I  ӀI  ӀI  ӀI  Ӏ
46)l (108) [2]0x04CF (1231)l  ӏl  ӏl  ӏl  ӏ
47)G (71) [1]0x050C (1292)G  GG  GG  GG  G
48)Q (81)0x051A (1306)Q  ԚQ  ԚQ  ԚQ  Ԛ
49)q (113)0x051B (1307)q  qq  qq  qq  q
50)W (87)0x051C (1308)W  ԜW  ԜW  ԜW  Ԝ
51)w (119)0x051D (1309)w  ww  ww  ww  w

[1] almost identical in all fonts
[2] identical in all fonts except "Verdana" (v5.02)
[3] identical in all fonts except "Courier New" (v5.11)
[4] identical only in the font "Arial" (v5.06)
legendary
Activity: 2674
Merit: 2334
September 03, 2018, 07:10:28 AM
#1
Usually bounty hunters of BitcoinTalk signature campaigns are required to write a certain number of posts within a week, participants are credited with the stakes for this activity. Sometimes unscrupulous users copy a messages of other members or a paragraphs from the external articles in the Internet and post them here on the forum. Such posts can easily be compared and tracked by SEO services, therefore these bounty hunters began using homographs to complicate detection.

Simplistically saying, homographs are symbols in the international Unicode table which look the same visually. The english alphabet uses only ASCII characters.

If homographs from different languages are mixed in some text, the human reading it will not distinguish any difference, however the analyzing systems will not be able to detect plagiarism by simply comparing texts encoded in UTF-8.

For example:
  • "SEO". Here are the ASCII characters only, homographs are not used. The word length in UTF-8 is 3 bytes.
  • "SEO". The first symbol "S" is taken from the macedonian alphabet, the second symbols "E" is taken from the greek alphabet, the third symbols "O" is taken from the russian alphabet here. These non-english letters look the same as an ASCII characters, but they are encoded by two bytes, so the word length in UTF-8 is 6 bytes.

Such a way some members who use homographs write posts on the forum, simply copying and modifying the texts of other people. Therefore I decided to create the full list of homographs that can be used in the texts in English.



According to the HTML code, the forum uses the following CSS style:
Code:
style="font-family: Verdana, Arial, sans-serif;"
Thus, the messages uses three fonts: "Verdana", "Arial" and "Sans Serif". Also, the "Courier New" is used for mono-space texts.

The table shows the ASCII characters and their homographs near by them that are written in all four of these fonts. Look at my next post below.
Jump to: