Text Adversarial Purification as Defense against Adversarial Attacks
Linyang Li, Demin Song, Xipeng Qiu

TL;DR
This paper introduces a novel adversarial purification technique for defending against textual adversarial attacks by masking and reconstructing texts using language models, effectively countering strong word-substitution attacks.
Contribution
It pioneers the application of adversarial purification in NLP by leveraging language models to defend against word-substitution adversarial attacks.
Findings
Successfully defends against Textfooler and BERT-Attack
Effective in removing adversarial perturbations from text
Improves robustness of textual models
Abstract
Adversarial purification is a successful defense mechanism against adversarial attacks without requiring knowledge of the form of the incoming attack. Generally, adversarial purification aims to remove the adversarial perturbations therefore can make correct predictions based on the recovered clean samples. Despite the success of adversarial purification in the computer vision field that incorporates generative models such as energy-based models and diffusion models, using purification as a defense strategy against textual adversarial attacks is rarely explored. In this work, we introduce a novel adversarial purification method that focuses on defending against textual adversarial attacks. With the help of language models, we can inject noise by masking input texts and reconstructing the masked texts based on the masked language models. In this way, we construct an adversarial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling
