Learning to Ignore Adversarial Attacks
Yiming Zhang, Yangqiaoyu Zhou, Samuel Carton, Chenhao Tan

TL;DR
This paper introduces rationale models that explicitly learn to ignore adversarial attack tokens, significantly improving NLP model robustness against attacks across multiple datasets and models.
Contribution
It proposes a novel rationale-based approach to enhance robustness by enabling models to ignore attack tokens, outperforming data augmentation methods.
Findings
Rationale models can ignore over 90% of attack tokens.
Achieves approximately 10% improvement in robustness over baselines.
Reduces the performance gap between clean and attacked test sets.
Abstract
Despite the strong performance of current NLP models, they can be brittle against adversarial attacks. To enable effective learning against adversarial inputs, we introduce the use of rationale models that can explicitly learn to ignore attack tokens. We find that the rationale models can successfully ignore over 90% of attack tokens. This approach leads to consistent sizable improvements (10%) over baseline models in robustness on three datasets for both BERT and RoBERTa, and also reliably outperforms data augmentation with adversarial examples alone. In many cases, we find that our method is able to close the gap between model performance on a clean test set and an attacked test set and hence reduce the effect of adversarial attacks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Linear Warmup With Linear Decay · Dense Connections · Dropout · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam
