Low-rank Momentum Factorization for Memory Efficient Training
Pouria Mahdavinia, Mehrdad Mahdavi

TL;DR
This paper introduces MoFaSGD, a low-rank momentum optimization method that reduces memory usage during large model fine-tuning while maintaining competitive performance, supported by theoretical guarantees and empirical results.
Contribution
MoFaSGD offers a novel dynamic low-rank momentum factorization approach that adaptively updates the optimization subspace, enabling memory-efficient training with theoretical convergence guarantees.
Findings
Achieves significant memory reduction comparable to LoRA.
Maintains competitive performance on large language model benchmarks.
Provides theoretical convergence guarantees for non-convex stochastic optimization.
Abstract
Fine-tuning large foundation models presents significant memory challenges due to stateful optimizers like AdamW, often requiring several times more GPU memory than inference. While memory-efficient methods like parameter-efficient fine-tuning (e.g., LoRA) and optimizer state compression exist, recent approaches like GaLore bridge these by using low-rank gradient projections and subspace moment accumulation. However, such methods may struggle with fixed subspaces or computationally costly offline resampling (e.g., requiring full-matrix SVDs). We propose Momentum Factorized SGD (MoFaSGD), which maintains a dynamically updated low-rank SVD representation of the first-order momentum, closely approximating its full-rank counterpart throughout training. This factorization enables a memory-efficient fine-tuning method that adaptively updates the optimization subspace at each iteration.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
