Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers
Robin M. Schmidt, Frank Schneider, Philipp Hennig

TL;DR
This paper provides a comprehensive benchmark of fifteen deep learning optimizers across various tasks, revealing that optimizer choice impacts performance significantly and that tuning a single optimizer can be as effective as trying multiple defaults.
Contribution
It offers an extensive, standardized benchmark of popular optimizers, analyzing over 50,000 runs to provide evidence-backed heuristics and identify generally effective optimization strategies.
Findings
Optimizer performance varies greatly across tasks.
Default parameters for multiple optimizers perform similarly to tuned single optimizers.
Adam remains a consistently strong optimizer, with newer methods not outperforming it significantly.
Abstract
Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Data Classification · Stochastic Gradient Optimization Techniques
MethodsAdam
