Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation
Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan

TL;DR
This paper introduces LoRA-Pre, a low-rank optimizer that reduces memory overhead in training large language models by reformulating optimizer states as low-rank linear regressors, achieving high efficiency and performance.
Contribution
The paper proposes LoRA-Pre, a novel low-rank optimizer that efficiently approximates optimizer states, enabling scalable training of large models with reduced memory usage.
Findings
LoRA-Pre achieves state-of-the-art performance across various model sizes.
It maintains high optimization quality with only 1/8 of the rank used by baseline methods.
LoRA-Pre outperforms standard fine-tuning baselines in multiple scenarios.
Abstract
Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training. Specifically, LoRA-Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA-Pre…
Peer Reviews
Decision·ICLR 2026 Oral
__1) Theoretical Insight:__ The paper establishes a non-trivial mathematical equivalence between EMA momentum updates and online linear regression (Section 3.2), which is both elegant and enables a new way to think about optimizer state compression beyond ad-hoc engineering. __2) Methodological Contribution:__ Closed-form Newton-based update rules are derived for the low-rank factors, with careful exposition (Section 3.3, Theorem 3.1, Appendix A). This enables stable and efficient momentum upda
__1) Incomplete related work coverage.__ The literature review appears incomplete. The paper omits several relevant lines of work where optimizer momenta or gradient statistics are decomposed or constrained in low-rank form. For example, MLorc [1] and MoFaSGD [2] address momentum/gradient factorization. Within PEFT, papers like LoRA-Pro [3] and LoFT [4] are highly relevant (the latter yielding updates that resemble those proposed here). Another closely related direction is Riemannian low-rank pr
1. The relation between EMA and online linear regression is novel and conceptually clean, providing a unified foundation for memory-efficient optimizer design. 2. The method applies seamlessly to multiple modern optimizers (Adam, Muon), offering a general solution for optimizer-state compression in large-scale model training. 3. LoRA-Pre consistently outperforms both projection-based and fine-tuning low-rank optimizers across different scales and tasks. Rank-efficiency studies reinforce the robu
1. Missing Key Baseline in main table: Although Fira [1] is mentioned in the related work and ablation (Table 3), it is missing from the main results table (Table 1). This omission is non-trivial, since Fira operates under **exactly the same setting** for pre-training and it also deals with the optimizer state itself under a low-rank constraint, making it a direct and essential baseline. The paper should include a comparison with Fira-Adam in the main results. Even if LoRA-Pre does not necessari
The paper introduces a clear and technically sound rethinking of momentum optimization by framing EMA updates as an online regression problem and deriving a low-rank closed-form solution. This perspective is both conceptually original and practically valuable, bridging optimization theory with efficient model training. The proposed LoRA-Pre method is well-integrated with existing optimizers like Adam and Muon, achieving strong empirical results with lower memory costs. The experiments are compre
The empirical evaluation primarily compares with projection-based low-rank optimizers; incorporating a wider range of modern baselines (e.g., Sophia, Shampoo) would provide a more complete assessment of practical advantages. The ablation studies emphasize rank efficiency but do not clearly disentangle the effect of the low-rank momentum representation from other implementation factors such as learning-rate scaling or normalization. The scalability and computational trade-offs of the proposed u
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Tensor decomposition and applications
