Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks
Yuanzhi Li, Colin Wei, Tengyu Ma

TL;DR
This paper explains why large initial learning rates in neural network training lead to better generalization after annealing, by analyzing pattern learning order and demonstrating effects on image classification.
Contribution
It provides a theoretical and empirical analysis showing large initial learning rates promote better generalization by influencing pattern learning order in neural networks.
Findings
Large initial learning rate improves generalization after annealing.
Small learning rate models memorize easy patterns first, harming generalization.
Adding a patch to images reveals differences in learning dynamics.
Abstract
Stochastic gradient descent with a large initial learning rate is widely used for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes easy-to-generalize, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and ELM
