TL;DR
FOAM is a memory-efficient optimizer for training large language models that significantly reduces memory usage while maintaining convergence and performance.
Contribution
The paper introduces FOAM, a novel optimizer that compresses optimizer states with block-wise gradient means and residual correction, achieving memory savings without performance loss.
Findings
Eliminates up to 90% of optimizer memory overhead.
Accelerates convergence compared to standard Adam.
Compatible with other memory-efficient optimizers, matching or surpassing their performance.
Abstract
Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
