Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on   Transformers, but Sign Descent Might Be

Frederik Kunstner; Jacques Chen; Jonathan Wilder Lavington; Mark; Schmidt

arXiv:2304.13960·cs.LG·April 28, 2023·5 cites

Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

Frederik Kunstner, Jacques Chen, Jonathan Wilder Lavington, Mark, Schmidt

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates why Adam outperforms SGD on transformers, finding that noise isn't the main factor and that Adam's advantage resembles sign descent with momentum, especially at large batch sizes.

Contribution

It challenges the heavy-tailed noise hypothesis and links Adam's performance to sign descent with momentum, providing new insights into optimizer behavior.

Findings

01

Noise and stochasticity are not primary factors in the SGD-Adam gap.

02

Adam's performance improves with larger batch sizes, unlike SGD.

03

Adam's behavior at large batches resembles sign descent with momentum.

Abstract

The success of the Adam optimizer on a wide array of architectures has made it the default in settings where stochastic gradient descent (SGD) performs poorly. However, our theoretical understanding of this discrepancy is lagging, preventing the development of significant improvements on either algorithm. Recent work advances the hypothesis that Adam and other heuristics like gradient clipping outperform SGD on language tasks because the distribution of the error induced by sampling has heavy tails. This suggests that Adam outperform SGD because it uses a more robust gradient estimate. We evaluate this hypothesis by varying the batch size, up to the entire dataset, to control for stochasticity. We present evidence that stochasticity and heavy-tailed noise are not major factors in the performance gap between SGD and Adam. Rather, Adam performs better as the batch size increases, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fkunstner/noise-sgd-adam-sign
pytorchOfficial

Videos

Noise Is Not the Main Factor Behind the Gap Between Sgd and Adam on Transformers, But Sign Descent Might Be· slideslive

Taxonomy

TopicsNeural Networks and Applications · Metaheuristic Optimization Algorithms Research · Advanced Neural Network Applications

MethodsGradient Clipping · Adam · Stochastic Gradient Descent