Gradient Multi-Normalization for Stateless and Scalable LLM Training

Meyer Scetbon; Chao Ma; Wenbo Gong; Edward Meeds

arXiv:2502.06742·cs.LG·February 11, 2025

Gradient Multi-Normalization for Stateless and Scalable LLM Training

Meyer Scetbon, Chao Ma, Wenbo Gong, Edward Meeds

PDF

Open Access

TL;DR

This paper introduces a novel, scalable stateless optimizer based on multi-normalization of gradients, improving large language model training efficiency and memory usage, and demonstrating significant speedups on LLaMA models.

Contribution

It presents a new framework for stateless optimization using multiple norm normalization, generalizing SWAN, and develops a more efficient, scalable optimizer for large-scale LLM training.

Findings

01

Achieves 3X training speedup over Adam on LLaMA models.

02

Reduces memory requirements significantly compared to traditional optimizers.

03

Outperforms existing memory-efficient optimization baselines.

Abstract

Training large language models (LLMs) typically relies on adaptive optimizers like Adam (Kingma & Ba, 2015) which store additional state information to accelerate convergence but incur significant memory overhead. Recent efforts, such as SWAN (Ma et al., 2024) address this by eliminating the need for optimizer states while achieving performance comparable to Adam via a multi-step preprocessing procedure applied to instantaneous gradients. Motivated by the success of SWAN, we introduce a novel framework for designing stateless optimizers that normalizes stochastic gradients according to multiple norms. To achieve this, we propose a simple alternating scheme to enforce the normalization of gradients w.r.t these norms. We show that our procedure can produce, up to an arbitrary precision, a fixed-point of the problem, and that SWAN is a particular instance of our approach with carefully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques