Fine-Tuning Language Models with Just Forward Passes
Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D., Lee, Danqi Chen, Sanjeev Arora

TL;DR
This paper introduces MeZO, a memory-efficient zeroth-order optimizer that enables fine-tuning large language models with significantly reduced memory and computational requirements, matching the performance of traditional methods.
Contribution
We propose MeZO, a novel in-place zeroth-order optimizer that allows large language models to be fine-tuned with minimal memory, overcoming previous theoretical limitations.
Findings
MeZO outperforms in-context learning and linear probing.
MeZO achieves comparable performance to backpropagation fine-tuning.
MeZO reduces memory usage by up to 12x and GPU hours by up to 2x.
Abstract
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Natural Language Processing Techniques
