TextShield: Beyond Successfully Detecting Adversarial Sentences in Text Classification
Lingfeng Shen, Ze Zhang, Haiyun Jiang, Ying Chen

TL;DR
TextShield introduces a saliency-based detection and correction framework for adversarial sentences in NLP, significantly improving detection accuracy and enabling correction to benign text, thus advancing defense methods beyond mere detection.
Contribution
It proposes a novel saliency-based detector and corrector that extend detection-only defenses to a detection-correction paradigm, filling a key gap in adversarial NLP defenses.
Findings
Outperforms state-of-the-art detection methods across various attacks.
Achieves higher or comparable detection performance on multiple benchmarks.
Effectively converts adversarial sentences into benign ones using saliency-based correction.
Abstract
Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications. A recent line of work, detection-based defense, aims to distinguish adversarial sentences from benign ones. However, {the core limitation of previous detection methods is being incapable of giving correct predictions on adversarial sentences unlike defense methods from other paradigms.} To solve this issue, this paper proposes TextShield: (1) we discover a link between text attack and saliency information, and then we propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not. (2) We design a saliency-based corrector, which converts the detected adversary sentences to benign ones. By combining the saliency-based detector and corrector, TextShield extends the detection-only paradigm to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning
