Learning-Rate-Free Learning by D-Adaptation
Aaron Defazio, Konstantin Mishchenko

TL;DR
D-Adaptation introduces a hyper-parameter free learning rate method that automatically achieves optimal convergence rates for convex Lipschitz functions without extra evaluations, demonstrated across diverse machine learning tasks.
Contribution
This paper presents the first hyper-parameter free learning rate method for convex optimization that matches optimal convergence rates without additional evaluations or log factors.
Findings
Automatically matches hand-tuned learning rates in diverse tasks
Achieves optimal convergence rates without back-tracking or line searches
Effective for SGD and Adam variants in large-scale problems
Abstract
D-Adaptation is an approach to automatically setting the learning rate which asymptotically achieves the optimal rate of convergence for minimizing convex Lipschitz functions, with no back-tracking or line searches, and no additional function value or gradient evaluations per step. Our approach is the first hyper-parameter free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. An open-source implementation is available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Algorithms · Sparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Stochastic Gradient Descent · Adam
