What do near-optimal learning rate schedules look like?
Hiroki Naganuma, Atish Agarwala, Priya Kasimbeg, George E. Dahl

TL;DR
This paper introduces a search method to identify near-optimal learning rate schedules for neural network training, revealing that warmup and decay are key features and that common schedules are suboptimal across various tasks.
Contribution
The authors develop a schedule search procedure that isolates shape from base learning rate, providing the most comprehensive analysis of near-optimal schedules to date.
Findings
Warmup and decay are robust features of good schedules.
Common schedule families are suboptimal for tested workloads.
Weight decay significantly influences optimal schedule shape.
Abstract
A basic unanswered question in neural network training is: what is the best learning rate schedule shape for a given workload? The choice of learning rate schedule is a key factor in the success or failure of the training process, but beyond having some kind of warmup and decay, there is no consensus on what makes a good schedule shape. To answer this question, we designed a search procedure to find the best shapes within a parameterized schedule family. Our approach factors out the schedule shape from the base learning rate, which otherwise would dominate cross-schedule comparisons. We applied our search procedure to a variety of schedule families on three workloads: linear regression, image classification on CIFAR-10, and small-scale language modeling on Wikitext103. We showed that our search procedure indeed generally found near-optimal schedules. We found that warmup and decay are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning
