Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

Athanasios Glentis; Jiaxiang Li; Andi Han; Mingyi Hong

arXiv:2506.16659·cs.LG·May 22, 2026

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

Athanasios Glentis, Jiaxiang Li, Andi Han, Mingyi Hong

PDF

1 Repo

TL;DR

The paper introduces SCALE, a simple, memory-efficient optimizer for large language model pretraining that matches or surpasses Adam's performance while using significantly less memory.

Contribution

It proposes two minimal modifications to SGD—column-wise gradient normalization and first-order momentum on the last layer—that together form SCALE, a highly memory-efficient optimizer.

Findings

01

SCALE matches or exceeds Adam's performance across multiple models.

02

SCALE uses only 35-45% of the memory required by Adam.

03

SCALE outperforms existing memory-efficient optimizers like GaLore, Fira, and APOLLO.

Abstract

Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed memory-efficient variants, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), that boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

optimai-lab/minimalist_llm_pretraining
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsStochastic Gradient Descent · LLaMA · Adaptive Parameter-wise Diagonal Quasi-Newton Method