What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning

Jaejun Lee; Raphael Tang; Jimmy Lin

arXiv:1911.03090·cs.CL·November 11, 2019·34 cites

What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning

Jaejun Lee, Raphael Tang, Jimmy Lin

PDF

Open Access 2 Models

TL;DR

This paper investigates how many of the final layers of transformer models like BERT and RoBERTa need to be fine-tuned, finding that only a quarter of the last layers suffice for near-optimal performance across various NLP tasks.

Contribution

It provides a precise analysis of layer-wise fine-tuning, showing that only a subset of final layers is necessary, challenging the assumption that all layers must be fine-tuned.

Findings

01

Fine-tuning the last quarter of layers achieves 90% of full fine-tuning performance.

02

Fine-tuning all layers does not always improve results.

03

Only a few final layers are needed for effective transfer learning.

Abstract

Pretrained transformer-based language models have achieved state of the art across countless tasks in natural language processing. These models are highly expressive, comprising at least a hundred million parameters and a dozen layers. Recent evidence suggests that only a few of the final layers need to be fine-tuned for high quality on downstream tasks. Naturally, a subsequent research question is, "how many of the last layers do we need to fine-tune?" In this paper, we precisely answer this question. We examine two recent pretrained language models, BERT and RoBERTa, across standard tasks in textual entailment, semantic similarity, sentiment analysis, and linguistic acceptability. We vary the number of final layers that are fine-tuned, then study the resulting change in task-specific effectiveness. We show that only a fourth of the final layers need to be fine-tuned to achieve 90% of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLinear Layer · RoBERTa · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece