Rethinking Textual Adversarial Defense for Pre-trained Language Models
Jiayi Wang, Rongzhou Bao, Zhuosheng Zhang, Hai Zhao

TL;DR
This paper introduces a novel framework for textual adversarial defense that uses anomaly detection and randomization to generate more natural adversarial examples and improve robustness of pre-trained language models.
Contribution
It proposes a new anomaly-based metric and a universal defense framework that do not rely on attack-specific knowledge, enhancing robustness against undetectable adversarial examples.
Findings
Existing adversarial examples are often unnatural and easily detected.
The proposed methods significantly reduce attack success rates.
The defense framework maintains high accuracy on original inputs.
Abstract
Although pre-trained language models (PrLMs) have achieved significant success, recent studies demonstrate that PrLMs are vulnerable to adversarial attacks. By generating adversarial examples with slight perturbations on different levels (sentence / word / character), adversarial attacks can fool PrLMs to generate incorrect predictions, which questions the robustness of PrLMs. However, we find that most existing textual adversarial examples are unnatural, which can be easily distinguished by both human and machine. Based on a general anomaly detector, we propose a novel metric (Degree of Anomaly) as a constraint to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples. Under this new constraint, the success rate of existing attacks drastically decreases, which reveals that the robustness of PrLMs is not as fragile as they claimed.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
