How fine can fine-tuning be? Learning efficient language models
Evani Radiya-Dixit, Xin Wang

TL;DR
This paper investigates how fine-tuning large language models like BERT can be made more efficient by identifying critical layers and sparsifying parameters, reducing storage and computation without sacrificing performance.
Contribution
It demonstrates that fine-tuning can be limited to key layers and that sparse models can match full fine-tuning results, enabling more efficient adaptation.
Findings
Fine-tuned models are close in parameter space to pre-trained models.
Only a subset of layers need to be fine-tuned for effective performance.
Sparse models with many zeroed entries can achieve comparable results.
Abstract
State-of-the-art performance on language understanding tasks is now achieved with increasingly large networks; the current record holder has billions of parameters. Given a language model pre-trained on massive unlabeled text corpora, only very light supervised fine-tuning is needed to learn a task: the number of fine-tuning steps is typically five orders of magnitude lower than the total parameter count. Does this mean that fine-tuning only introduces small differences from the pre-trained model in the parameter space? If so, can one avoid storing and computing an entire model for each task? In this work, we address these questions by using Bidirectional Encoder Representations from Transformers (BERT) as an example. As expected, we find that the fine-tuned models are close in parameter space to the pre-trained one, with the closeness varying from layer to layer. We show that it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
