On Adaptivity in Zeroth-Order Optimization

Hassan Dbouk; Nidham Gazagnadou; Matthias Reisser; Christos Louizos

arXiv:2605.03869·cs.LG·May 6, 2026

On Adaptivity in Zeroth-Order Optimization

Hassan Dbouk, Nidham Gazagnadou, Matthias Reisser, Christos Louizos

PDF

TL;DR

This paper critically examines adaptive zeroth-order optimization methods for large language models, revealing their inefficiency and proposing a new memory-efficient optimizer, MEAZO, with theoretical guarantees and strong empirical performance.

Contribution

The paper demonstrates that existing adaptive ZO methods offer no convergence benefits over ZO-SGD and introduces MEAZO, a novel memory-efficient adaptive optimizer with proven convergence.

Findings

01

Adaptive ZO methods like ZO-Adam do not outperform ZO-SGD in convergence.

02

MEAZO matches ZO-Adam's performance with significantly less memory.

03

MEAZO shows improved robustness to step size choices in experiments.

Abstract

We investigate the effectiveness of adaptive zeroth-order (ZO) optimization for memory-constrained fine-tuning of large language models (LLMs). Contrary to prior claims, we show that adaptive ZO methods such as ZO-Adam offer no convergence advantage over well-tuned ZO-SGD, while incurring significant memory overhead. Our analysis reveals that in high dimensions, ZO gradients lack coordinate-wise heterogeneity, rendering adaptive mechanisms memory inefficient. Leveraging this insight, we propose MEAZO, a memory-efficient adaptive ZO optimizer that tracks only a single scalar for global step size adaptation. We support our method with theoretical convergence guarantees under standard assumptions. Experiments across multiple LLM families and tasks demonstrate that MEAZO matches ZO-Adam's performance with the memory footprint of ZO-SGD. Additional experiments on synthetic quadratic problems…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.