Low-rank Momentum Factorization for Memory Efficient Training

Pouria Mahdavinia; Mehrdad Mahdavi

arXiv:2507.08091·cs.LG·July 14, 2025

Low-rank Momentum Factorization for Memory Efficient Training

Pouria Mahdavinia, Mehrdad Mahdavi

PDF

TL;DR

This paper introduces MoFaSGD, a low-rank momentum optimization method that reduces memory usage during large model fine-tuning while maintaining competitive performance, supported by theoretical guarantees and empirical results.

Contribution

MoFaSGD offers a novel dynamic low-rank momentum factorization approach that adaptively updates the optimization subspace, enabling memory-efficient training with theoretical convergence guarantees.

Findings

01

Achieves significant memory reduction comparable to LoRA.

02

Maintains competitive performance on large language model benchmarks.

03

Provides theoretical convergence guarantees for non-convex stochastic optimization.

Abstract

Fine-tuning large foundation models presents significant memory challenges due to stateful optimizers like AdamW, often requiring several times more GPU memory than inference. While memory-efficient methods like parameter-efficient fine-tuning (e.g., LoRA) and optimizer state compression exist, recent approaches like GaLore bridge these by using low-rank gradient projections and subspace moment accumulation. However, such methods may struggle with fixed subspaces or computationally costly offline resampling (e.g., requiring full-matrix SVDs). We propose Momentum Factorized SGD (MoFaSGD), which maintains a dynamically updated low-rank SVD representation of the first-order momentum, closely approximating its full-rank counterpart throughout training. This factorization enables a memory-efficient fine-tuning method that adaptively updates the optimization subspace at each iteration.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.