Improving the Adversarial Robustness of NLP Models by Information   Bottleneck

Cenyuan Zhang; Xiang Zhou; Yixin Wan; Xiaoqing Zheng; Kai-Wei Chang,; Cho-Jui Hsieh

arXiv:2206.05511·cs.CL·June 14, 2022

Improving the Adversarial Robustness of NLP Models by Information Bottleneck

Cenyuan Zhang, Xiang Zhou, Yixin Wan, Xiaoqing Zheng, Kai-Wei Chang,, Cho-Jui Hsieh

PDF

1 Repo

TL;DR

This paper proposes an information bottleneck approach to enhance NLP models' adversarial robustness by focusing on task-specific robust features, leading to significant improvements over existing defenses without sacrificing accuracy.

Contribution

It introduces a novel information bottleneck-based method to filter out non-robust features, improving adversarial robustness in NLP models.

Findings

01

Achieves higher robust accuracy than previous methods.

02

Maintains high clean accuracy on multiple datasets.

03

Effective in eliminating non-robust features.

Abstract

Existing studies have demonstrated that adversarial examples can be directly attributed to the presence of non-robust features, which are highly predictive, but can be easily manipulated by adversaries to fool NLP models. In this study, we explore the feasibility of capturing task-specific robust features, while eliminating the non-robust ones by using the information bottleneck theory. Through extensive experiments, we show that the models trained with our information bottleneck-based method are able to achieve a significant improvement in robust accuracy, exceeding performances of all the previously reported defense methods while suffering almost no performance drop in clean accuracy on SST-2, AGNEWS and IMDB datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhangcen456/ib
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.