TL;DR
MARTHE is an online algorithm that optimizes learning rate schedules using hypergradients, leading to more stable training and improved generalization in machine learning models.
Contribution
It introduces MARTHE, a novel hypergradient-based method that adapts learning rates online, combining features of RTHO and HD for better stability and generalization.
Findings
MARTHE produces more stable learning rate schedules.
Models trained with MARTHE generalize better.
The method effectively interpolates between existing hyperparameter optimization techniques.
Abstract
We study the problem of fitting task-specific learning rate schedules from the perspective of hyperparameter optimization, aiming at good generalization. We describe the structure of the gradient of a validation error w.r.t. the learning rate schedule -- the hypergradient. Based on this, we introduce MARTHE, a novel online algorithm guided by cheap approximations of the hypergradient that uses past information from the optimization trajectory to simulate future behaviour. It interpolates between two recent techniques, RTHO (Franceschi et al., 2017) and HD (Baydin et al. 2018), and is able to produce learning rate schedules that are more stable leading to models that generalize better.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
