How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial   Robustness?

Xinhsuai Dong; Luu Anh Tuan; Min Lin; Shuicheng Yan; Hanwang Zhang

arXiv:2112.11668·cs.CL·December 23, 2021·30 cites

How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?

Xinhsuai Dong, Luu Anh Tuan, Min Lin, Shuicheng Yan, Hanwang Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces RIFT, a novel adversarial fine-tuning method for pre-trained language models that better retains learned features and improves robustness against adversarial attacks in NLP tasks.

Contribution

The paper proposes RIFT, an information-theoretic adversarial fine-tuning approach that mitigates catastrophic forgetting and enhances adversarial robustness of pre-trained models.

Findings

01

RIFT outperforms state-of-the-art methods on sentiment analysis and natural language inference.

02

RIFT maintains more robust linguistic features during fine-tuning.

03

Experimental results show improved resistance to various adversarial attacks.

Abstract

The fine-tuning of pre-trained language models has a great success in many NLP fields. Yet, it is strikingly vulnerable to adversarial examples, e.g., word substitution attacks using only synonyms can easily fool a BERT-based sentiment analysis model. In this paper, we demonstrate that adversarial training, the prevalent defense technique, does not directly fit a conventional fine-tuning scenario, because it suffers severely from catastrophic forgetting: failing to retain the generic and robust linguistic features that have already been captured by the pre-trained model. In this light, we propose Robust Informative Fine-Tuning (RIFT), a novel adversarial fine-tuning method from an information-theoretical perspective. In particular, RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process, whereas a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dongxinshuai/rift-neurips2021
pytorchOfficial

Videos

How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection