On Empirical Comparisons of Optimizers for Deep Learning

Dami Choi; Christopher J. Shallue; Zachary Nado; Jaehoon Lee; Chris J.; Maddison; George E. Dahl

arXiv:1910.05446·cs.LG·June 17, 2020·185 cites

On Empirical Comparisons of Optimizers for Deep Learning

Dami Choi, Christopher J. Shallue, Zachary Nado, Jaehoon Lee, Chris J., Maddison, George E. Dahl

PDF

Open Access 2 Repos

TL;DR

This paper shows that hyperparameter tuning protocols critically influence optimizer comparison results in deep learning, revealing that adaptive methods generally outperform momentum-based optimizers when properly tuned.

Contribution

It demonstrates the importance of hyperparameter search spaces in optimizer comparisons and highlights that inclusion relationships between optimizers are practically significant.

Findings

01

Adaptive gradient methods outperform momentum and gradient descent when hyperparameters are properly tuned.

02

Changing hyperparameter search spaces can invert optimizer rankings.

03

Proper tuning and search space design are crucial for fair optimizer benchmarking.

Abstract

Selecting an optimizer is a central step in the contemporary deep learning pipeline. In this paper, we demonstrate the sensitivity of optimizer comparisons to the hyperparameter tuning protocol. Our findings suggest that the hyperparameter search space may be the single most important factor explaining the rankings obtained by recent empirical comparisons in the literature. In fact, we show that these results can be contradicted when hyperparameter search spaces are changed. As tuning effort grows without bound, more general optimizers should never underperform the ones they can approximate (i.e., Adam should never perform worse than momentum), but recent attempts to compare optimizers either assume these inclusion relationships are not practically relevant or restrict the hyperparameters in ways that break the inclusions. In our experiments, we find that inclusion relationships between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and Algorithms

MethodsAdam