Efficient Fine-Tuning of Compressed Language Models with Learners

Danilo Vucetic; Mohammadreza Tayaranian; Maryam Ziaeefard; James J.; Clark; Brett H. Meyer; Warren J. Gross

arXiv:2208.02070·cs.CL·August 4, 2022

Efficient Fine-Tuning of Compressed Language Models with Learners

Danilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard, James J., Clark, Brett H. Meyer, Warren J. Gross

PDF

Open Access

TL;DR

This paper introduces Learner modules and priming techniques for efficient fine-tuning of compressed language models, reducing resource use and training time while maintaining or improving performance.

Contribution

It proposes novel Learner modules and priming methods that exploit model overparameterization to enhance fine-tuning efficiency and effectiveness.

Findings

01

Learners perform on par or better than baselines.

02

Learners fine-tune 20% faster on CoLA.

03

Learners use 7x fewer parameters than state-of-the-art methods.

Abstract

Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many prior works aim to improve inference efficiency via compression techniques, e.g., pruning, these works do not explicitly address the computational challenges of training to downstream tasks. We introduce Learner modules and priming, novel methods for fine-tuning that exploit the overparameterization of pre-trained language models to gain benefits in convergence speed and resource utilization. Learner modules navigate the double bind of 1) training efficiently by fine-tuning a subset of parameters, and 2) training effectively by ensuring quick convergence and high metric scores. Our results on DistilBERT demonstrate that learners perform on par with or surpass the baselines. Learners train 7x fewer parameters than state-of-the-art methods on GLUE. On CoLA, learners fine-tune 20% faster, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Adam · Residual Connection · Dropout · Attention Dropout · Linear Warmup With Linear Decay · Dense Connections · Multi-Head Attention