On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and   Strong Baselines

Marius Mosbach; Maksym Andriushchenko; Dietrich Klakow

arXiv:2006.04884·cs.LG·March 26, 2021·211 cites

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

Marius Mosbach, Maksym Andriushchenko, Dietrich Klakow

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper investigates the causes of instability in fine-tuning BERT models, debunks previous hypotheses, and proposes a simple baseline to improve stability, highlighting optimization challenges and generalization differences.

Contribution

The paper reveals that fine-tuning instability is due to optimization issues rather than catastrophic forgetting or dataset size, and introduces a strong baseline for stability.

Findings

01

Fine-tuning instability is caused by optimization difficulties leading to vanishing gradients.

02

Differences in generalization contribute to performance variance even with similar training loss.

03

A simple baseline significantly improves the stability of BERT fine-tuning.

Abstract

Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines· slideslive

Taxonomy

TopicsNuclear reactor physics and engineering · Magnetic confinement fusion research · Model Reduction and Neural Networks

MethodsLinear Layer · Weight Decay · Softmax · Adam · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Linear Warmup With Linear Decay · Dense Connections · Layer Normalization