SWAN: SGD with Normalization and Whitening Enables Stateless LLM   Training

Chao Ma; Wenbo Gong; Meyer Scetbon; Edward Meeds

arXiv:2412.13148·cs.LG·February 24, 2025

SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training

Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds

PDF

Open Access 1 Video

TL;DR

SWAN is a novel stateless optimizer that uses gradient normalization and whitening to match Adam's performance in large language model training while significantly reducing memory usage.

Contribution

The paper introduces SWAN, a stateless SGD-based optimizer with normalization and whitening, enabling efficient LLM training without optimizer states.

Findings

01

SWAN achieves similar or better performance than Adam in LLM training.

02

SWAN reduces total memory footprint by approximately 50%.

03

SWAN doubles training speed by reaching target perplexity with fewer tokens.

Abstract

Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while drastically reducing the memory cost. Specifically, we propose to pre-process the instantaneous stochastic gradients using normalization and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training· slideslive

Taxonomy

TopicsTopic Modeling

MethodsLLaMA · Adam · Stochastic Gradient Descent