Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning
Baohao Liao, Shaomu Tan, Christof Monz

TL;DR
This paper introduces MEFT, a memory-efficient fine-tuning method that makes pre-trained language models reversible by inserting adapters, significantly reducing memory usage while maintaining performance.
Contribution
The paper proposes a novel approach to make PLMs reversible during fine-tuning without additional pre-training, greatly reducing activation memory.
Findings
Reduces activation memory by up to 84% compared to full fine-tuning
Maintains comparable performance on GLUE and QA tasks
Applicable to various backbones like BERT, RoBERTa, BART, and OPT
Abstract
Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · WordPiece · Linear Warmup With Linear Decay · Attention Dropout · Dropout · Adam · Byte Pair Encoding
