Fine-Tuning Language Models with Just Forward Passes

Sadhika Malladi; Tianyu Gao; Eshaan Nichani; Alex Damian; Jason D.; Lee; Danqi Chen; Sanjeev Arora

arXiv:2305.17333·cs.LG·January 12, 2024·36 cites

Fine-Tuning Language Models with Just Forward Passes

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D., Lee, Danqi Chen, Sanjeev Arora

PDF

Open Access 3 Repos 1 Datasets 1 Video

TL;DR

This paper introduces MeZO, a memory-efficient zeroth-order optimizer that enables fine-tuning large language models with significantly reduced memory and computational requirements, matching the performance of traditional methods.

Contribution

We propose MeZO, a novel in-place zeroth-order optimizer that allows large language models to be fine-tuned with minimal memory, overcoming previous theoretical limitations.

Findings

01

MeZO outperforms in-context learning and linear probing.

02

MeZO achieves comparable performance to backpropagation fine-tuning.

03

MeZO reduces memory usage by up to 12x and GPU hours by up to 2x.

Abstract

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Sherirto/BD4UI
dataset· 35 dl
35 dl

Videos

Fine-Tuning Language Models with Just Forward Passes· slideslive

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Natural Language Processing Techniques