On Empirical Comparisons of Optimizers for Deep Learning
Dami Choi, Christopher J. Shallue, Zachary Nado, Jaehoon Lee, Chris J., Maddison, George E. Dahl

TL;DR
This paper shows that hyperparameter tuning protocols critically influence optimizer comparison results in deep learning, revealing that adaptive methods generally outperform momentum-based optimizers when properly tuned.
Contribution
It demonstrates the importance of hyperparameter search spaces in optimizer comparisons and highlights that inclusion relationships between optimizers are practically significant.
Findings
Adaptive gradient methods outperform momentum and gradient descent when hyperparameters are properly tuned.
Changing hyperparameter search spaces can invert optimizer rankings.
Proper tuning and search space design are crucial for fair optimizer benchmarking.
Abstract
Selecting an optimizer is a central step in the contemporary deep learning pipeline. In this paper, we demonstrate the sensitivity of optimizer comparisons to the hyperparameter tuning protocol. Our findings suggest that the hyperparameter search space may be the single most important factor explaining the rankings obtained by recent empirical comparisons in the literature. In fact, we show that these results can be contradicted when hyperparameter search spaces are changed. As tuning effort grows without bound, more general optimizers should never underperform the ones they can approximate (i.e., Adam should never perform worse than momentum), but recent attempts to compare optimizers either assume these inclusion relationships are not practically relevant or restrict the hyperparameters in ways that break the inclusions. In our experiments, we find that inclusion relationships between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and Algorithms
MethodsAdam
