TL;DR
This paper presents a deep learning approach to identify homoglyphs—visually similar characters—using weakly labeled data, significantly improving detection accuracy and enabling clustering of homoglyphs for security applications.
Contribution
The authors develop a novel deep learning model leveraging weak labels to identify and cluster homoglyphs, outperforming previous methods and predicting new homoglyphs.
Findings
Achieved an average precision of 0.97 in homoglyph identification.
Developed a clustering method with 0.592 mBIOU performance.
Predicted over 8,000 previously unknown homoglyphs.
Abstract
Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors. It is thus important to understand the capabilities of an attacker to identify homoglyphs -- particularly ones that have not been previously spotted -- and leverage them in attacks. We investigate a deep-learning model using embedding learning, transfer learning, and augmentation to determine the visual similarity of characters and thereby identify potential homoglyphs. Our approach uniquely takes advantage of weak labels that arise from the fact that most characters are not homoglyphs. Our model drastically outperforms the Normalized Compression Distance approach on pairwise homoglyph identification, for which we achieve an average precision of 0.97. We also present the first attempt at clustering homoglyphs into sets of equivalence classes, which is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
