TL;DR
This paper introduces a partial orthogonalization technique using power iteration to accelerate zeroth-order spectral optimization, significantly improving convergence speed in large language model fine-tuning.
Contribution
It proposes replacing the Newton-Schulz orthogonalization with a streaming power-iteration method for better efficiency and robustness in noisy zeroth-order optimization.
Findings
Achieves 1.5x to 4x faster convergence than ZO-Muon.
Reaches competitive final accuracies with less training time.
Demonstrates effectiveness across multiple large language models.
Abstract
Zeroth-order (ZO) optimization has become increasingly popular and important in fine-tuning large language models (LLMs), especially on edge devices due to its ability to adjust the model to local data without the need for memory-intensive back-propagation. Recent works try to reduce ZO variance through low-dimensional subspace search, but subspace restriction alone leaves key optimization geometry under-exploited, motivating additional acceleration. In this work, we focus on the hidden layer training problem in which spectral optimizers like Muon outperform AdamW due to its ability to exploit weak spectral directions by orthogonalization. However, we have discovered that unlike in the first-order setting, full orthogonalization works poorly in the ZO setting since the gradient estimates are highly noisy and unreliable. To address this issue, we propose applying partial spectral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
