Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization
Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, Sijia Liu

TL;DR
This paper introduces ZO-Muon, a novel zeroth-order optimization method that leverages subspace projection and spectral gradient orthogonalization to improve efficiency and accuracy in large-scale model fine-tuning.
Contribution
The paper proposes a unified framework of subspace gradient orthogonalization and introduces ZO-Muon, a new method that enhances zeroth-order optimization for large models.
Findings
ZO-Muon accelerates convergence on LLMs and ViTs.
ZO-Muon reduces query complexity by over 75% compared to MeZO.
ZO-Muon improves accuracy significantly in fine-tuning tasks.
Abstract
Zeroth-order (ZO) optimization provides a gradient-free alternative to first-order (FO) methods by estimating gradients via finite differences of function evaluations, and has recently emerged as a memory-efficient paradigm for fine-tuning large-scale models by avoiding backpropagation. However, ZO optimization has a fundamental tension between accuracy and query efficiency. In this work, we show that ZO optimization can be substantially improved by unifying two complementary principles: (i) a projection-based subspace view that reduces gradient estimation variance by exploiting the intrinsic low-rank structure of model updates, and (ii) Muon-style spectral optimization that applies gradient orthogonalization to extract informative spectral structure from noisy ZO gradients. These findings form a unified framework of subspace gradient orthogonalization, which we instantiate in a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning in Materials Science · Machine Learning and Data Classification
