SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training
Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds

TL;DR
SWAN is a novel stateless optimizer that uses gradient normalization and whitening to match Adam's performance in large language model training while significantly reducing memory usage.
Contribution
The paper introduces SWAN, a stateless SGD-based optimizer with normalization and whitening, enabling efficient LLM training without optimizer states.
Findings
SWAN achieves similar or better performance than Adam in LLM training.
SWAN reduces total memory footprint by approximately 50%.
SWAN doubles training speed by reaching target perplexity with fewer tokens.
Abstract
Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while drastically reducing the memory cost. Specifically, we propose to pre-process the instantaneous stochastic gradients using normalization and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling
MethodsLLaMA · Adam · Stochastic Gradient Descent
