FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale Speed

Sizhe Dang; Yangyang Guo; Yanjun Zhao; Haishan Ye; Xiaodong Zheng; Guang Dai; Ivor Tsang

arXiv:2506.09034·cs.LG·July 1, 2025

FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale Speed

Sizhe Dang, Yangyang Guo, Yanjun Zhao, Haishan Ye, Xiaodong Zheng, Guang Dai, Ivor Tsang

PDF

Open Access

TL;DR

FZOO is a novel zeroth-order optimizer that significantly reduces memory usage and accelerates convergence in fine-tuning large language models, matching Adam's speed with fewer forward passes and enabling practical single-GPU training.

Contribution

Introduces FZOO, a fast zeroth-order optimizer that improves convergence speed and memory efficiency for large language model fine-tuning, with theoretical guarantees and practical integration.

Findings

01

FZOO outperforms MeZO by 3% in accuracy on average.

02

FZOO requires 3 times fewer forward passes than MeZO.

03

FZOO achieves convergence speeds comparable to Adam.

Abstract

Fine-tuning large language models (LLMs) often faces GPU memory bottlenecks: the backward pass of first-order optimizers like Adam increases memory usage to more than 10 times the inference level (e.g., 633 GB for OPT-30B). Zeroth-order (ZO) optimizers avoid this cost by estimating gradients only from forward passes, yet existing methods like MeZO usually require many more steps to converge. Can this trade-off between speed and memory in ZO be fundamentally improved? Normalized-SGD demonstrates strong empirical performance with greater memory efficiency than Adam. In light of this, we introduce FZOO, a Fast Zeroth-Order Optimizer toward Adam-Scale Speed. FZOO reduces the total forward passes needed for convergence by employing batched one-sided estimates that adapt step sizes based on the standard deviation of batch losses. It also accelerates per-batch computation through the use of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsOPT · Adam · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings