What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning
Jaejun Lee, Raphael Tang, Jimmy Lin

TL;DR
This paper investigates how many of the final layers of transformer models like BERT and RoBERTa need to be fine-tuned, finding that only a quarter of the last layers suffice for near-optimal performance across various NLP tasks.
Contribution
It provides a precise analysis of layer-wise fine-tuning, showing that only a subset of final layers is necessary, challenging the assumption that all layers must be fine-tuned.
Findings
Fine-tuning the last quarter of layers achieves 90% of full fine-tuning performance.
Fine-tuning all layers does not always improve results.
Only a few final layers are needed for effective transfer learning.
Abstract
Pretrained transformer-based language models have achieved state of the art across countless tasks in natural language processing. These models are highly expressive, comprising at least a hundred million parameters and a dozen layers. Recent evidence suggests that only a few of the final layers need to be fine-tuned for high quality on downstream tasks. Naturally, a subsequent research question is, "how many of the last layers do we need to fine-tune?" In this paper, we precisely answer this question. We examine two recent pretrained language models, BERT and RoBERTa, across standard tasks in textual entailment, semantic similarity, sentiment analysis, and linguistic acceptability. We vary the number of final layers that are fine-tuned, then study the resulting change in task-specific effectiveness. We show that only a fourth of the final layers need to be fine-tuned to achieve 90% of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · RoBERTa · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece
