RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection
He Zhu, Yanshu Li, Wen Liu, Haitian Yang

TL;DR
RTD-Guard is a black-box textual adversarial detection framework that uses a pre-trained discriminator to identify and mask suspicious tokens, effectively detecting adversarial examples with minimal queries and no model fine-tuning.
Contribution
This paper introduces RTD-Guard, a novel black-box detection method that leverages a pre-trained Replaced Token Detection discriminator without fine-tuning or adversarial data, for practical and resource-efficient adversarial detection.
Findings
Outperforms existing detection methods on multiple benchmarks.
Requires only two black-box queries per detection.
Effectively detects diverse state-of-the-art adversarial attacks.
Abstract
Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This paper introduces RTD-Guard, a novel black-box framework for detecting textual adversarial examples. Our key insight is that word-substitution perturbations in adversarial attacks closely resemble the "replaced tokens" that a Replaced Token Detection (RTD) discriminator is pre-trained to identify. Leveraging this, RTD-Guard employs an off-the-shelf RTD discriminator-without fine-tuning-to localize suspicious tokens, masks them, and detects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Graph Neural Networks
