Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Zhengbo Wang; Jian Liang; Ran He; Zilei Wang; Tieniu Tan

arXiv:2602.24283·cs.LG·March 2, 2026

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces LoRA-Pre, a low-rank optimizer that reduces memory overhead in training large language models by reformulating optimizer states as low-rank linear regressors, achieving high efficiency and performance.

Contribution

The paper proposes LoRA-Pre, a novel low-rank optimizer that efficiently approximates optimizer states, enabling scalable training of large models with reduced memory usage.

Findings

01

LoRA-Pre achieves state-of-the-art performance across various model sizes.

02

It maintains high optimization quality with only 1/8 of the rank used by baseline methods.

03

LoRA-Pre outperforms standard fine-tuning baselines in multiple scenarios.

Abstract

Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training. Specifically, LoRA-Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA-Pre…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 3

Strengths

__1) Theoretical Insight:__ The paper establishes a non-trivial mathematical equivalence between EMA momentum updates and online linear regression (Section 3.2), which is both elegant and enables a new way to think about optimizer state compression beyond ad-hoc engineering. __2) Methodological Contribution:__ Closed-form Newton-based update rules are derived for the low-rank factors, with careful exposition (Section 3.3, Theorem 3.1, Appendix A). This enables stable and efficient momentum upda

Weaknesses

__1) Incomplete related work coverage.__ The literature review appears incomplete. The paper omits several relevant lines of work where optimizer momenta or gradient statistics are decomposed or constrained in low-rank form. For example, MLorc [1] and MoFaSGD [2] address momentum/gradient factorization. Within PEFT, papers like LoRA-Pro [3] and LoFT [4] are highly relevant (the latter yielding updates that resemble those proposed here). Another closely related direction is Riemannian low-rank pr

Reviewer 02Rating 6Confidence 4

Strengths

1. The relation between EMA and online linear regression is novel and conceptually clean, providing a unified foundation for memory-efficient optimizer design. 2. The method applies seamlessly to multiple modern optimizers (Adam, Muon), offering a general solution for optimizer-state compression in large-scale model training. 3. LoRA-Pre consistently outperforms both projection-based and fine-tuning low-rank optimizers across different scales and tasks. Rank-efficiency studies reinforce the robu

Weaknesses

1. Missing Key Baseline in main table: Although Fira [1] is mentioned in the related work and ablation (Table 3), it is missing from the main results table (Table 1). This omission is non-trivial, since Fira operates under **exactly the same setting** for pre-training and it also deals with the optimizer state itself under a low-rank constraint, making it a direct and essential baseline. The paper should include a comparison with Fira-Adam in the main results. Even if LoRA-Pre does not necessari

Reviewer 03Rating 6Confidence 3

Strengths

The paper introduces a clear and technically sound rethinking of momentum optimization by framing EMA updates as an online regression problem and deriving a low-rank closed-form solution. This perspective is both conceptually original and practically valuable, bridging optimization theory with efficient model training. The proposed LoRA-Pre method is well-integrated with existing optimizers like Adam and Muon, achieving strong empirical results with lower memory costs. The experiments are compre

Weaknesses

The empirical evaluation primarily compares with projection-based low-rank optimizers; incorporating a wider range of modern baselines (e.g., Sophia, Shampoo) would provide a more complete assessment of practical advantages. The ablation studies emphasize rank efficiency but do not clearly disentangle the effect of the low-rank momentum representation from other implementation factors such as learning-rate scaling or normalization. The scalability and computational trade-offs of the proposed u

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Tensor decomposition and applications