Learning-Rate-Free Learning by D-Adaptation

Aaron Defazio; Konstantin Mishchenko

arXiv:2301.07733·cs.LG·July 11, 2023·5 cites

Learning-Rate-Free Learning by D-Adaptation

Aaron Defazio, Konstantin Mishchenko

PDF

Open Access 1 Repo 1 Video

TL;DR

D-Adaptation introduces a hyper-parameter free learning rate method that automatically achieves optimal convergence rates for convex Lipschitz functions without extra evaluations, demonstrated across diverse machine learning tasks.

Contribution

This paper presents the first hyper-parameter free learning rate method for convex optimization that matches optimal convergence rates without additional evaluations or log factors.

Findings

01

Automatically matches hand-tuned learning rates in diverse tasks

02

Achieves optimal convergence rates without back-tracking or line searches

03

Effective for SGD and Adam variants in large-scale problems

Abstract

D-Adaptation is an approach to automatically setting the learning rate which asymptotically achieves the optimal rate of convergence for minimizing convex Lipschitz functions, with no back-tracking or line searches, and no additional function value or gradient evaluations per step. Our approach is the first hyper-parameter free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. An open-source implementation is available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/dadaptation
pytorchOfficial

Videos

Learning-Rate-Free Learning by D-Adaptation· slideslive

Taxonomy

TopicsMachine Learning and Algorithms · Sparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Stochastic Gradient Descent · Adam