Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates
Athanasios Glentis, Dawei Li, Chung-Yiu Yau, Mingyi Hong

TL;DR
This paper investigates why SGD underperforms compared to Adam in LLM pre-training, attributing it to SGD's difficulty in maintaining large effective learning rates, and demonstrates that simple clipping can close this gap.
Contribution
The study provides empirical and theoretical insights into the SGD-Adam performance gap in LLM training and proposes clipping techniques to enable SGD to match Adam's effectiveness.
Findings
Large effective learning rates are crucial for LLM pre-training.
Clipping mechanisms enable SGD to perform nearly as well as Adam.
The validation loss gap reduces from over 50% to 3.5% with clipping.
Abstract
It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptive optimizers such as Adam in pre-training Large Language Models (LLMs). Yet the underlying reason for this gap remains unclear. In this work, we attribute a large part of the discrepancy to SGD's inability to sustain learning rates comparable to Adam's much larger effective learning rates. Through empirical and theoretical analysis of LLM pre-training dynamics, we identify that training is characterized by small gradient norms and large weight-to-gradient ratios, an effect that becomes more pronounced with larger batch sizes typical in pre-training, necessitating such large effective learning rates. However, we find that output-layer gradient magnitudes become highly uneven across token classes, and that large gradient spikes frequently occur during training. Together, these effects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
