TL;DR
This paper proposes an information bottleneck approach to enhance NLP models' adversarial robustness by focusing on task-specific robust features, leading to significant improvements over existing defenses without sacrificing accuracy.
Contribution
It introduces a novel information bottleneck-based method to filter out non-robust features, improving adversarial robustness in NLP models.
Findings
Achieves higher robust accuracy than previous methods.
Maintains high clean accuracy on multiple datasets.
Effective in eliminating non-robust features.
Abstract
Existing studies have demonstrated that adversarial examples can be directly attributed to the presence of non-robust features, which are highly predictive, but can be easily manipulated by adversaries to fool NLP models. In this study, we explore the feasibility of capturing task-specific robust features, while eliminating the non-robust ones by using the information bottleneck theory. Through extensive experiments, we show that the models trained with our information bottleneck-based method are able to achieve a significant improvement in robust accuracy, exceeding performances of all the previously reported defense methods while suffering almost no performance drop in clean accuracy on SST-2, AGNEWS and IMDB datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
