On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
Marius Mosbach, Maksym Andriushchenko, Dietrich Klakow

TL;DR
This paper investigates the causes of instability in fine-tuning BERT models, debunks previous hypotheses, and proposes a simple baseline to improve stability, highlighting optimization challenges and generalization differences.
Contribution
The paper reveals that fine-tuning instability is due to optimization issues rather than catastrophic forgetting or dataset size, and introduces a strong baseline for stability.
Findings
Fine-tuning instability is caused by optimization difficulties leading to vanishing gradients.
Differences in generalization contribute to performance variance even with similar training loss.
A simple baseline significantly improves the stability of BERT fine-tuning.
Abstract
Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNuclear reactor physics and engineering · Magnetic confinement fusion research · Model Reduction and Neural Networks
MethodsLinear Layer · Weight Decay · Softmax · Adam · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Linear Warmup With Linear Decay · Dense Connections · Layer Normalization
