Automatic Gradient Descent: Deep Learning without Hyperparameters
Jeremy Bernstein, Chris Mingard, Kevin Huang, Navid Azizan, and Yisong Yue

TL;DR
This paper introduces automatic gradient descent, a hyperparameter-free optimizer tailored for neural architectures, extending mirror descent theory to non-convex deep networks, enabling out-of-the-box training at large scale.
Contribution
It develops a new optimization framework that explicitly incorporates neural architecture, resulting in an automatic, hyperparameter-free gradient descent method for deep learning.
Findings
Successfully trains deep networks without hyperparameters
Applies to fully-connected and convolutional networks
Operates efficiently at ImageNet scale
Abstract
The architecture of a deep neural network is defined explicitly in terms of the number of layers, the width of each layer and the general network topology. Existing optimisation frameworks neglect this information in favour of implicit architectural information (e.g. second-order methods) or architecture-agnostic distance functions (e.g. mirror descent). Meanwhile, the most popular optimiser in practice, Adam, is based on heuristics. This paper builds a new framework for deriving optimisation algorithms that explicitly leverage neural architecture. The theory extends mirror descent to non-convex composite objective functions: the idea is to transform a Bregman divergence to account for the non-linear structure of neural architecture. Working through the details for deep fully-connected networks yields automatic gradient descent: a first-order optimiser without any hyperparameters.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and ELM
MethodsAdam
