Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation
Hanjie Chen, Yangfeng Ji

TL;DR
This paper introduces FLAT, a novel adversarial training method that enhances neural language models' robustness by aligning both their predictions and interpretability when faced with adversarial examples, especially synonym substitutions.
Contribution
FLAT is a new feature-level adversarial training approach that regularizes global word importance scores to improve model robustness in predictions and interpretations.
Findings
FLAT improves robustness of LSTM, CNN, BERT, DeBERTa models against adversarial attacks.
Models trained with FLAT show better generalization to unseen adversarial examples.
FLAT enhances both prediction accuracy and interpretability under adversarial conditions.
Abstract
Neural language models show vulnerability to adversarial examples which are semantically similar to their original counterparts with a few words replaced by their synonyms. A common way to improve model robustness is adversarial training which follows two steps-collecting adversarial examples by attacking a target model, and fine-tuning the model on the augmented dataset with these adversarial examples. The objective of traditional adversarial training is to make a model produce the same correct predictions on an original/adversarial example pair. However, the consistency between model decision-makings on two similar texts is ignored. We argue that a robust model should behave consistently on original/adversarial example pairs, that is making the same predictions (what) based on the same reasons (how) which can be reflected by consistent interpretations. In this work, we propose a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Layer Normalization · Adam · Attention Dropout · Residual Connection · Dense Connections
