BitAbuse: A Dataset of Visually Perturbed Texts for Defending Phishing   Attacks

Hanyong Lee; Chaelyn Lee; Yongjae Lee; Jaesung Lee

arXiv:2502.05225·cs.CR·February 11, 2025

BitAbuse: A Dataset of Visually Perturbed Texts for Defending Phishing Attacks

Hanyong Lee, Chaelyn Lee, Yongjae Lee, Jaesung Lee

PDF

Open Access 1 Repo 1 Video

TL;DR

The paper introduces BitAbuse, a large dataset of real-world visually perturbed phishing texts, to improve language model robustness against adversarial attacks, demonstrating significant performance gains over previous synthetic datasets.

Contribution

It provides the first large-scale dataset of real-world visually perturbed phishing texts, enabling more effective training of models to defend against adversarial attacks.

Findings

01

Language models trained on BitAbuse achieved ~96% accuracy.

02

Significant performance gap between real-world and synthetic datasets.

03

The dataset enhances model robustness against visually perturbed phishing texts.

Abstract

Phishing often targets victims through visually perturbed texts to bypass security systems. The noise contained in these texts functions as an adversarial attack, designed to deceive language models and hinder their ability to accurately interpret the content. However, since it is difficult to obtain sufficient phishing cases, previous studies have used synthetic datasets that do not contain real-world cases. In this study, we propose the BitAbuse dataset, which includes real-world phishing cases, to address the limitations of previous research. Our dataset comprises a total of 325,580 visually perturbed texts. The dataset inputs are drawn from the raw corpus, consisting of visually perturbed sentences and sentences generated through an artificial perturbation process. Each input sentence is labeled with its corresponding ground truth, representing the restored, non-perturbed version.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CAU-AutoML/Bitabuse
noneOfficial

Videos

BitAbuse: A Dataset of Visually Perturbed Texts for Defending Phishing Attacks· underline

Taxonomy

TopicsSpam and Phishing Detection · Misinformation and Its Impacts · Hate Speech and Cyberbullying Detection