RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

He Zhu; Yanshu Li; Wen Liu; Haitian Yang

arXiv:2603.12582·cs.CL·March 16, 2026

RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

He Zhu, Yanshu Li, Wen Liu, Haitian Yang

PDF

Open Access

TL;DR

RTD-Guard is a black-box textual adversarial detection framework that uses a pre-trained discriminator to identify and mask suspicious tokens, effectively detecting adversarial examples with minimal queries and no model fine-tuning.

Contribution

This paper introduces RTD-Guard, a novel black-box detection method that leverages a pre-trained Replaced Token Detection discriminator without fine-tuning or adversarial data, for practical and resource-efficient adversarial detection.

Findings

01

Outperforms existing detection methods on multiple benchmarks.

02

Requires only two black-box queries per detection.

03

Effectively detects diverse state-of-the-art adversarial attacks.

Abstract

Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This paper introduces RTD-Guard, a novel black-box framework for detecting textual adversarial examples. Our key insight is that word-substitution perturbations in adversarial attacks closely resemble the "replaced tokens" that a Replaced Token Detection (RTD) discriminator is pre-trained to identify. Leveraging this, RTD-Guard employs an off-the-shelf RTD discriminator-without fine-tuning-to localize suspicious tokens, masks them, and detects…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Graph Neural Networks