TextShield: Beyond Successfully Detecting Adversarial Sentences in Text   Classification

Lingfeng Shen; Ze Zhang; Haiyun Jiang; Ying Chen

arXiv:2302.02023·cs.CL·February 7, 2023

TextShield: Beyond Successfully Detecting Adversarial Sentences in Text Classification

Lingfeng Shen, Ze Zhang, Haiyun Jiang, Ying Chen

PDF

Open Access 1 Video

TL;DR

TextShield introduces a saliency-based detection and correction framework for adversarial sentences in NLP, significantly improving detection accuracy and enabling correction to benign text, thus advancing defense methods beyond mere detection.

Contribution

It proposes a novel saliency-based detector and corrector that extend detection-only defenses to a detection-correction paradigm, filling a key gap in adversarial NLP defenses.

Findings

01

Outperforms state-of-the-art detection methods across various attacks.

02

Achieves higher or comparable detection performance on multiple benchmarks.

03

Effectively converts adversarial sentences into benign ones using saliency-based correction.

Abstract

Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications. A recent line of work, detection-based defense, aims to distinguish adversarial sentences from benign ones. However, {the core limitation of previous detection methods is being incapable of giving correct predictions on adversarial sentences unlike defense methods from other paradigms.} To solve this issue, this paper proposes TextShield: (1) we discover a link between text attack and saliency information, and then we propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not. (2) We design a saliency-based corrector, which converts the detected adversary sentences to benign ones. By combining the saliency-based detector and corrector, TextShield extends the detection-only paradigm to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TextShield: Beyond Successfully Detecting Adversarial Sentences in text classification· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning