On Adaptivity in Zeroth-Order Optimization
Hassan Dbouk, Nidham Gazagnadou, Matthias Reisser, Christos Louizos

TL;DR
This paper critically examines adaptive zeroth-order optimization methods for large language models, revealing their inefficiency and proposing a new memory-efficient optimizer, MEAZO, with theoretical guarantees and strong empirical performance.
Contribution
The paper demonstrates that existing adaptive ZO methods offer no convergence benefits over ZO-SGD and introduces MEAZO, a novel memory-efficient adaptive optimizer with proven convergence.
Findings
Adaptive ZO methods like ZO-Adam do not outperform ZO-SGD in convergence.
MEAZO matches ZO-Adam's performance with significantly less memory.
MEAZO shows improved robustness to step size choices in experiments.
Abstract
We investigate the effectiveness of adaptive zeroth-order (ZO) optimization for memory-constrained fine-tuning of large language models (LLMs). Contrary to prior claims, we show that adaptive ZO methods such as ZO-Adam offer no convergence advantage over well-tuned ZO-SGD, while incurring significant memory overhead. Our analysis reveals that in high dimensions, ZO gradients lack coordinate-wise heterogeneity, rendering adaptive mechanisms memory inefficient. Leveraging this insight, we propose MEAZO, a memory-efficient adaptive ZO optimizer that tracks only a single scalar for global step size adaptation. We support our method with theoretical convergence guarantees under standard assumptions. Experiments across multiple LLM families and tasks demonstrate that MEAZO matches ZO-Adam's performance with the memory footprint of ZO-SGD. Additional experiments on synthetic quadratic problems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
