Subspace Optimization for Large Language Models with Convergence Guarantees
Yutong He, Pengrui Li, Yipeng Hu, Chuyan Chen, Kun Yuan

TL;DR
This paper analyzes the convergence properties of subspace optimization algorithms like GaLore for large language models, identifies their limitations, and introduces a new variant, GoLore, with proven convergence guarantees in stochastic settings.
Contribution
We reveal that GaLore does not always converge and propose GoLore, a new algorithm with proven convergence guarantees for stochastic large language model training.
Findings
GaLore can fail to converge in some cases.
Convergence of GaLore depends on batch size and gradient noise.
GoLore guarantees convergence even with standard batch sizes.
Abstract
Subspace optimization algorithms, such as GaLore (Zhao et al., 2024), have gained attention for pre-training and fine-tuning large language models (LLMs) due to their memory efficiency. However, their convergence guarantees remain unclear, particularly in stochastic settings. In this paper, we reveal that GaLore does not always converge to the optimal solution and provide an explicit counterexample to support this finding. We further explore the conditions under which GaLore achieves convergence, showing that it does so when either (i) a sufficiently large mini-batch size is used or (ii) the gradient noise is isotropic. More significantly, we introduce GoLore (Gradient random Low-rank projection), a novel variant of GaLore that provably converges in typical stochastic settings, even with standard batch sizes. Our convergence analysis extends naturally to other subspace optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need
