FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training
Philip Zmushko, Aleksandr Beznosikov, Martin Tak\'a\v{c}, Samuel Horv\'ath

TL;DR
FRUGAL is a novel optimization framework that reduces memory overhead in training large models by splitting gradients and combining low-rank updates with state-free methods, improving efficiency without sacrificing performance.
Contribution
It introduces gradient splitting with low-rank updates and state-free methods, providing theoretical guarantees and outperforming existing approaches in memory-constrained training.
Findings
Outperforms concurrent methods across fixed memory budgets
Achieves state-of-the-art results in pre-training and fine-tuning
Balances memory efficiency with high performance
Abstract
With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA (Hu et al., 2021)), low-rank gradient projection (GaLore (Zhao et al., 2024)), and blockwise optimization (BAdam (Luo et al., 2024)) have been proposed. However, in all these algorithms, the , which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce (ull-ank pdates with rdient spitting), a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications · Parallel Computing and Optimization Techniques
MethodsStochastic Gradient Descent
