Intrinsic Dimensionality Explains the Effectiveness of Language Model   Fine-Tuning

Armen Aghajanyan; Luke Zettlemoyer; Sonal Gupta

arXiv:2012.13255·cs.LG·December 25, 2020

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta

PDF

2 Repos 3 Models

TL;DR

This paper demonstrates that the low intrinsic dimensionality of pretrained language models explains their high efficiency in fine-tuning, even with limited data, by showing that a small subset of parameters suffices for near-optimal performance.

Contribution

It introduces the concept of intrinsic dimension to analyze fine-tuning, showing that models have a low intrinsic dimension and that pretraining reduces this dimension, explaining their effectiveness.

Findings

01

Pretrained models have low intrinsic dimension.

02

Optimizing a small number of parameters achieves near full-model performance.

03

Pretraining reduces intrinsic dimension, especially in larger models.

Abstract

Although pretrained language models can be fine-tuned to produce state-of-the-art results for a very wide range of language understanding tasks, the dynamics of this process are not well understood, especially in the low data regime. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples? In this paper, we argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions to explain this remarkable phenomenon. We empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space. For example, by optimizing only 200…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Softmax · WordPiece · Linear Warmup With Linear Decay · Adam · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Layer Normalization · Attention Is All You Need