The Break-Even Point on Optimization Trajectories of Deep Neural Networks
Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit,, Jacek Tabor, Kyunghyun Cho, Krzysztof Geras

TL;DR
This paper investigates how the early phase of training deep neural networks with SGD influences their final performance, highlighting the importance of the break-even point where regularization effects emerge.
Contribution
It introduces the concept of the break-even point in optimization trajectories and demonstrates how initial hyperparameters affect the loss surface and gradient noise.
Findings
Large initial learning rates reduce gradient variance.
Early phase training impacts the conditioning of the loss surface.
Low learning rates lead to poor loss surface conditioning.
Abstract
The early phase of training of deep neural networks is critical for their final performance. In this work, we study how the hyperparameters of stochastic gradient descent (SGD) used in the early phase of training affect the rest of the optimization trajectory. We argue for the existence of the "break-even" point on this trajectory, beyond which the curvature of the loss surface and noise in the gradient are implicitly regularized by SGD. In particular, we demonstrate on multiple classification tasks that using a large learning rate in the initial phase of training reduces the variance of the gradient, and improves the conditioning of the covariance of gradients. These effects are beneficial from the optimization perspective and become visible after the break-even point. Complementing prior work, we also show that using a low learning rate results in bad conditioning of the loss surface…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning
MethodsBatch Normalization · Stochastic Gradient Descent
