FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

Philip Zmushko; Aleksandr Beznosikov; Martin Tak\'a\v{c}; Samuel Horv\'ath

arXiv:2411.07837·cs.LG·August 15, 2025

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

Philip Zmushko, Aleksandr Beznosikov, Martin Tak\'a\v{c}, Samuel Horv\'ath

PDF

Open Access 1 Repo 1 Video

TL;DR

FRUGAL is a novel optimization framework that reduces memory overhead in training large models by splitting gradients and combining low-rank updates with state-free methods, improving efficiency without sacrificing performance.

Contribution

It introduces gradient splitting with low-rank updates and state-free methods, providing theoretical guarantees and outperforming existing approaches in memory-constrained training.

Findings

01

Outperforms concurrent methods across fixed memory budgets

02

Achieves state-of-the-art results in pre-training and fine-tuning

03

Balances memory efficiency with high performance

Abstract

With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA (Hu et al., 2021)), low-rank gradient projection (GaLore (Zhao et al., 2024)), and blockwise optimization (BAdam (Luo et al., 2024)) have been proposed. However, in all these algorithms, the $effective rank of the weight updates remains low-rank$ , which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce $FRUGAL$ ( $F$ ull- $R$ ank $U$ pdates with $G$ r $A$ dient sp $L$ itting), a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fzmushko/frugal
pytorchOfficial

Videos

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training· slideslive

Taxonomy

TopicsNeural Networks and Applications · Parallel Computing and Optimization Techniques

MethodsStochastic Gradient Descent