Adversarial Training for Improving Model Robustness? Look at Both   Prediction and Interpretation

Hanjie Chen; Yangfeng Ji

arXiv:2203.12709·cs.CL·March 25, 2022

Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation

Hanjie Chen, Yangfeng Ji

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces FLAT, a novel adversarial training method that enhances neural language models' robustness by aligning both their predictions and interpretability when faced with adversarial examples, especially synonym substitutions.

Contribution

FLAT is a new feature-level adversarial training approach that regularizes global word importance scores to improve model robustness in predictions and interpretations.

Findings

01

FLAT improves robustness of LSTM, CNN, BERT, DeBERTa models against adversarial attacks.

02

Models trained with FLAT show better generalization to unseen adversarial examples.

03

FLAT enhances both prediction accuracy and interpretability under adversarial conditions.

Abstract

Neural language models show vulnerability to adversarial examples which are semantically similar to their original counterparts with a few words replaced by their synonyms. A common way to improve model robustness is adversarial training which follows two steps-collecting adversarial examples by attacking a target model, and fine-tuning the model on the augmented dataset with these adversarial examples. The objective of traditional adversarial training is to make a model produce the same correct predictions on an original/adversarial example pair. However, the consistency between model decision-makings on two similar texts is ignored. We argue that a robust model should behave consistently on original/adversarial example pairs, that is making the same predictions (what) based on the same reasons (how) which can be reflected by consistent interpretations. In this work, we propose a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uva-nlp/flat
jaxOfficial

Videos

Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Layer Normalization · Adam · Attention Dropout · Residual Connection · Dense Connections