Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and   Performance of SGD for Fine-Tuning Language Models

Zeman Li; Xinwei Zhang; Peilin Zhong; Yuan Deng; Meisam Razaviyayn,; Vahab Mirrokni

arXiv:2410.06441·cs.LG·October 10, 2024

Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models

Zeman Li, Xinwei Zhang, Peilin Zhong, Yuan Deng, Meisam Razaviyayn,, Vahab Mirrokni

PDF

Open Access

TL;DR

Addax is a novel optimization method that combines zeroth- and first-order gradients to enhance memory efficiency and performance in fine-tuning large language models, outperforming existing methods in accuracy, speed, and memory usage.

Contribution

Addax introduces a hybrid gradient computation approach that improves convergence and reduces memory consumption during language model fine-tuning, with theoretical convergence guarantees.

Findings

01

Addax outperforms MeZO in accuracy and convergence speed.

02

Addax runs 15-30x faster than MeZO on large models.

03

Addax uses similar memory to MeZO but achieves better results.

Abstract

Fine-tuning language models (LMs) with the Adam optimizer often demands excessive memory, limiting accessibility. The "in-place" version of Stochastic Gradient Descent (IP-SGD) and Memory-Efficient Zeroth-order Optimizer (MeZO) have been proposed to address this. However, IP-SGD still requires substantial memory, and MeZO suffers from slow convergence and degraded final performance due to its zeroth-order nature. This paper introduces Addax, a novel method that improves both memory efficiency and performance of IP-SGD by integrating it with MeZO. Specifically, Addax computes zeroth- or first-order gradients of data points in the minibatch based on their memory consumption, combining these gradient estimates to update directions. By computing zeroth-order gradients for data points that require more memory and first-order gradients for others, Addax overcomes the slow convergence of MeZO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsAdam · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings