TL;DR
This paper demonstrates that the low intrinsic dimensionality of pretrained language models explains their high efficiency in fine-tuning, even with limited data, by showing that a small subset of parameters suffices for near-optimal performance.
Contribution
It introduces the concept of intrinsic dimension to analyze fine-tuning, showing that models have a low intrinsic dimension and that pretraining reduces this dimension, explaining their effectiveness.
Findings
Pretrained models have low intrinsic dimension.
Optimizing a small number of parameters achieves near full-model performance.
Pretraining reduces intrinsic dimension, especially in larger models.
Abstract
Although pretrained language models can be fine-tuned to produce state-of-the-art results for a very wide range of language understanding tasks, the dynamics of this process are not well understood, especially in the low data regime. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples? In this paper, we argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions to explain this remarkable phenomenon. We empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space. For example, by optimizing only 200…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Softmax · WordPiece · Linear Warmup With Linear Decay · Adam · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Layer Normalization · Attention Is All You Need
