Gradient Multi-Normalization for Stateless and Scalable LLM Training
Meyer Scetbon, Chao Ma, Wenbo Gong, Edward Meeds

TL;DR
This paper introduces a novel, scalable stateless optimizer based on multi-normalization of gradients, improving large language model training efficiency and memory usage, and demonstrating significant speedups on LLaMA models.
Contribution
It presents a new framework for stateless optimization using multiple norm normalization, generalizing SWAN, and develops a more efficient, scalable optimizer for large-scale LLM training.
Findings
Achieves 3X training speedup over Adam on LLaMA models.
Reduces memory requirements significantly compared to traditional optimizers.
Outperforms existing memory-efficient optimization baselines.
Abstract
Training large language models (LLMs) typically relies on adaptive optimizers like Adam (Kingma & Ba, 2015) which store additional state information to accelerate convergence but incur significant memory overhead. Recent efforts, such as SWAN (Ma et al., 2024) address this by eliminating the need for optimizer states while achieving performance comparable to Adam via a multi-step preprocessing procedure applied to instantaneous gradients. Motivated by the success of SWAN, we introduce a novel framework for designing stateless optimizers that normalizes stochastic gradients according to multiple norms. To achieve this, we propose a simple alternating scheme to enforce the normalization of gradients w.r.t these norms. We show that our procedure can produce, up to an arbitrary precision, a fixed-point of the problem, and that SWAN is a particular instance of our approach with carefully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
