Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
Frederik Kunstner, Jacques Chen, Jonathan Wilder Lavington, Mark, Schmidt

TL;DR
This paper investigates why Adam outperforms SGD on transformers, finding that noise isn't the main factor and that Adam's advantage resembles sign descent with momentum, especially at large batch sizes.
Contribution
It challenges the heavy-tailed noise hypothesis and links Adam's performance to sign descent with momentum, providing new insights into optimizer behavior.
Findings
Noise and stochasticity are not primary factors in the SGD-Adam gap.
Adam's performance improves with larger batch sizes, unlike SGD.
Adam's behavior at large batches resembles sign descent with momentum.
Abstract
The success of the Adam optimizer on a wide array of architectures has made it the default in settings where stochastic gradient descent (SGD) performs poorly. However, our theoretical understanding of this discrepancy is lagging, preventing the development of significant improvements on either algorithm. Recent work advances the hypothesis that Adam and other heuristics like gradient clipping outperform SGD on language tasks because the distribution of the error induced by sampling has heavy tails. This suggests that Adam outperform SGD because it uses a more robust gradient estimate. We evaluate this hypothesis by varying the batch size, up to the entire dataset, to control for stochasticity. We present evidence that stochasticity and heavy-tailed noise are not major factors in the performance gap between SGD and Adam. Rather, Adam performs better as the batch size increases, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications · Metaheuristic Optimization Algorithms Research · Advanced Neural Network Applications
MethodsGradient Clipping · Adam · Stochastic Gradient Descent
