TL;DR
The paper introduces SCALE, a simple, memory-efficient optimizer for large language model pretraining that matches or surpasses Adam's performance while using significantly less memory.
Contribution
It proposes two minimal modifications to SGD—column-wise gradient normalization and first-order momentum on the last layer—that together form SCALE, a highly memory-efficient optimizer.
Findings
SCALE matches or exceeds Adam's performance across multiple models.
SCALE uses only 35-45% of the memory required by Adam.
SCALE outperforms existing memory-efficient optimizers like GaLore, Fira, and APOLLO.
Abstract
Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed memory-efficient variants, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), that boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsStochastic Gradient Descent · LLaMA · Adaptive Parameter-wise Diagonal Quasi-Newton Method
