Efficient Fine-Tuning of Compressed Language Models with Learners
Danilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard, James J., Clark, Brett H. Meyer, Warren J. Gross

TL;DR
This paper introduces Learner modules and priming techniques for efficient fine-tuning of compressed language models, reducing resource use and training time while maintaining or improving performance.
Contribution
It proposes novel Learner modules and priming methods that exploit model overparameterization to enhance fine-tuning efficiency and effectiveness.
Findings
Learners perform on par or better than baselines.
Learners fine-tune 20% faster on CoLA.
Learners use 7x fewer parameters than state-of-the-art methods.
Abstract
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many prior works aim to improve inference efficiency via compression techniques, e.g., pruning, these works do not explicitly address the computational challenges of training to downstream tasks. We introduce Learner modules and priming, novel methods for fine-tuning that exploit the overparameterization of pre-trained language models to gain benefits in convergence speed and resource utilization. Learner modules navigate the double bind of 1) training efficiently by fine-tuning a subset of parameters, and 2) training effectively by ensuring quick convergence and high metric scores. Our results on DistilBERT demonstrate that learners perform on par with or surpass the baselines. Learners train 7x fewer parameters than state-of-the-art methods on GLUE. On CoLA, learners fine-tune 20% faster, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Adam · Residual Connection · Dropout · Attention Dropout · Linear Warmup With Linear Decay · Dense Connections · Multi-Head Attention
