Unbiased Gradient Low-Rank Projection
Rui Pan, Yang Luo, Yuxing Liu, Yang You, Tong Zhang

TL;DR
This paper introduces GUM, an unbiased low-rank optimization method for training large language models, which maintains convergence guarantees and improves performance over existing low-rank techniques and full-parameter training.
Contribution
It proposes a novel layerwise sampling technique for debiasing low-rank projections, combining GaLore and Muon algorithms to ensure unbiasedness and convergence.
Findings
GUM matches Muon's convergence guarantees
GUM outperforms GaLore in LLM fine-tuning
GUM achieves better performance than full-parameter training
Abstract
Memory-efficient optimization is critical for training increasingly large language models (LLMs). A popular strategy involves gradient low-rank projection, storing only the projected optimizer states, with GaLore being a representative example. However, a significant drawback of many such methods is their lack of convergence guarantees, as various low-rank projection approaches introduce inherent biases relative to the original optimization algorithms, which contribute to performance gaps compared to full-parameter training. Aiming to tackle this problem, this paper investigates the layerwise sampling technique for debiasing low-rank projection mechanisms. In particular, an instantiation of the paradigm gives rise to a novel and unbiased low-rank optimization method built upon GaLore's mechanism and the Muon algorithm, named GaLore Unbiased with Muon (GUM). We theoretically prove our…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The proposed method highlights and shows an important limitation in GaLore-style low-rank PEFT methods - that most low-rank optimization methods introduce biased gradient estimations during training. This paper also rightly highlights the challenges in analyzing the convergence properties of such methods. 2. Theoretical Contribution and convergence analysis of GUM is an important contribution. The motivating example to show why GaLore might fail in an extremely noisy setting highlights an imp
1. Some ablations on the probability of full-rank updates, sampling period, or rank of low-rank updates could be insightful. 2. Inherent issues with Muon as an optimizer haven’t been fixed or addressed - i) for large hidden layers, computational overhead from Newton-Schulz Updates would be significant; - ii) Muon has only been studied for dense hidden linear layers and stability and efficiency will degrade in a sparse training regime; - iii) Although Muon improves conditioning without second m
1. Theoretical Guarantees: It proves that GUM is an unbiased estimator and, as a result, matches the convergence guarantees of its base optimizer (Muon). This directly addresses a major theoretical weakness in GaLore. The synthetic experiment in Figure 1, where GaLore fails to converge but GUM succeeds, provides a stark practical example of this theoretical advantage. 2. Analysis: The paper provides a plausible explanation for GUM's success, linking its high rank updates to a higher stable rank
1. Limited pre-training scale. The pre-training results, while promising, are on relatively small models (up to 350M). Demonstrating the performance win over full rank AdamW on the larger 8B scale models would make the claims even more conclusive. 2. The method introduces new hyperparameters, namely the sampling probability $\gamma$ (or number of full rank layers) and the sampling period $K$. The authors rightly note in their limitations that this sampling can introduce high variance, which lea
The strengths of this paper can be summarized as follows: 1. Memory efficiency of LLMs is a very important and popular area of research. The biased gradient updates in low-rank optimizer is a well-known issue. Addressing this is interesting to the machine learning community. 2. The authors have given a rigorous proof for the unbiasedness and convergence. As far as I can see, all the proofs look correct to me. 3. The authors have conducted experiments on both fine-tuning and pre-training on mu
The weaknesses of this paper are summarized as follows: 1. This paper does not have an ablation study. It would be better to test by changing the sampling probability and projection ranks. 2. Also, the model pretraining mainly focuses on the small models, such as LLaMA-60M, LLaMA-130M, and LLaMA-350M. 3. The improvements are from 0.3% to 1.1%. These improvements are relatively small. The authors may consider showing whether or not these margins are consistent across longer training or larger
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Sparse and Compressive Sensing Techniques
