Weaponizing Unicodes with Deep Learning -- Identifying Homoglyphs with   Weakly Labeled Data

Perry Deng; Cooper Linsky; Matthew Wright

arXiv:2010.04382·cs.CR·December 23, 2020

Weaponizing Unicodes with Deep Learning -- Identifying Homoglyphs with Weakly Labeled Data

Perry Deng, Cooper Linsky, Matthew Wright

PDF

1 Repo

TL;DR

This paper presents a deep learning approach to identify homoglyphs—visually similar characters—using weakly labeled data, significantly improving detection accuracy and enabling clustering of homoglyphs for security applications.

Contribution

The authors develop a novel deep learning model leveraging weak labels to identify and cluster homoglyphs, outperforming previous methods and predicting new homoglyphs.

Findings

01

Achieved an average precision of 0.97 in homoglyph identification.

02

Developed a clustering method with 0.592 mBIOU performance.

03

Predicted over 8,000 previously unknown homoglyphs.

Abstract

Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors. It is thus important to understand the capabilities of an attacker to identify homoglyphs -- particularly ones that have not been previously spotted -- and leverage them in attacks. We investigate a deep-learning model using embedding learning, transfer learning, and augmentation to determine the visual similarity of characters and thereby identify potential homoglyphs. Our approach uniquely takes advantage of weak labels that arise from the fact that most characters are not homoglyphs. Our model drastically outperforms the Normalized Compression Distance approach on pairwise homoglyph identification, for which we achieve an average precision of 0.97. We also present the first attempt at clustering homoglyphs into sets of equivalence classes, which is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PerryXDeng/weaponizing_unicode
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.