Towards Explaining the Regularization Effect of Initial Large Learning   Rate in Training Neural Networks

Yuanzhi Li; Colin Wei; Tengyu Ma

arXiv:1907.04595·cs.LG·April 28, 2020·124 cites

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Yuanzhi Li, Colin Wei, Tengyu Ma

PDF

Open Access 2 Repos

TL;DR

This paper explains why large initial learning rates in neural network training lead to better generalization after annealing, by analyzing pattern learning order and demonstrating effects on image classification.

Contribution

It provides a theoretical and empirical analysis showing large initial learning rates promote better generalization by influencing pattern learning order in neural networks.

Findings

01

Large initial learning rate improves generalization after annealing.

02

Small learning rate models memorize easy patterns first, harming generalization.

03

Adding a patch to images reveals differences in learning dynamics.

Abstract

Stochastic gradient descent with a large initial learning rate is widely used for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes easy-to-generalize, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and ELM