MARTHE: Scheduling the Learning Rate Via Online Hypergradients

Michele Donini; Luca Franceschi; Massimiliano Pontil; Orchid Majumder,; Paolo Frasconi

arXiv:1910.08525·cs.LG·May 19, 2020

MARTHE: Scheduling the Learning Rate Via Online Hypergradients

Michele Donini, Luca Franceschi, Massimiliano Pontil, Orchid Majumder,, Paolo Frasconi

PDF

1 Repo

TL;DR

MARTHE is an online algorithm that optimizes learning rate schedules using hypergradients, leading to more stable training and improved generalization in machine learning models.

Contribution

It introduces MARTHE, a novel hypergradient-based method that adapts learning rates online, combining features of RTHO and HD for better stability and generalization.

Findings

01

MARTHE produces more stable learning rate schedules.

02

Models trained with MARTHE generalize better.

03

The method effectively interpolates between existing hyperparameter optimization techniques.

Abstract

We study the problem of fitting task-specific learning rate schedules from the perspective of hyperparameter optimization, aiming at good generalization. We describe the structure of the gradient of a validation error w.r.t. the learning rate schedule -- the hypergradient. Based on this, we introduce MARTHE, a novel online algorithm guided by cheap approximations of the hypergradient that uses past information from the optimization trajectory to simulate future behaviour. It interpolates between two recent techniques, RTHO (Franceschi et al., 2017) and HD (Baydin et al. 2018), and is able to produce learning rate schedules that are more stable leading to models that generalize better.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

awslabs/adatune
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.