For the last couple of months Ramin Yazdani has been looking into phishing domains using Unicode characters to appear like the target domain. In this process he developed a new ‘confusables’ table of Unicode characters which can easily be mistaken for their ASCII counterpart. The table is based on the ‘Unicode Confusables list’ and the ‘Unicode Similarity List’.
The proposed Unicode Confusables table can be found here. The dataset is supplied as a ‘csv’ file where the first column represents the decimal codepoints of the Unicode characters. The following columns together represent the homoglyph for this character (if there is a string to character mapping you would see multiple homoglyph parts, otherwise only one part).
Additionally, Ramin used the confusables table to find domains which have a ASCII counterpart. The research is aimed at finding malicious Unicode homoglyph domains. To this end Ramin compared his findings with entries from the following blacklists: