Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

Yicheng Lang; Changsheng Wang; Yihua Zhang; Mingyi Hong; Zheng Zhang; Wotao Yin; Sijia Liu

arXiv:2602.17155·cs.LG·February 24, 2026

Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, Sijia Liu

PDF

Open Access

TL;DR

This paper introduces ZO-Muon, a novel zeroth-order optimization method that leverages subspace projection and spectral gradient orthogonalization to improve efficiency and accuracy in large-scale model fine-tuning.

Contribution

The paper proposes a unified framework of subspace gradient orthogonalization and introduces ZO-Muon, a new method that enhances zeroth-order optimization for large models.

Findings

01

ZO-Muon accelerates convergence on LLMs and ViTs.

02

ZO-Muon reduces query complexity by over 75% compared to MeZO.

03

ZO-Muon improves accuracy significantly in fine-tuning tasks.

Abstract

Zeroth-order (ZO) optimization provides a gradient-free alternative to first-order (FO) methods by estimating gradients via finite differences of function evaluations, and has recently emerged as a memory-efficient paradigm for fine-tuning large-scale models by avoiding backpropagation. However, ZO optimization has a fundamental tension between accuracy and query efficiency. In this work, we show that ZO optimization can be substantially improved by unifying two complementary principles: (i) a projection-based subspace view that reduces gradient estimation variance by exploiting the intrinsic low-rank structure of model updates, and (ii) Muon-style spectral optimization that applies gradient orthogonalization to extract informative spectral structure from noisy ZO gradients. These findings form a unified framework of subspace gradient orthogonalization, which we instantiate in a new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning in Materials Science · Machine Learning and Data Classification