Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities
Lawrence Wang, Stephen J. Roberts

TL;DR
This paper shows that gradient descent instabilities caused by large learning rates can lead to better generalization by exploring flatter regions of the loss landscape through Hessian eigenvector rotations.
Contribution
It reveals that instabilities in gradient descent induce Hessian eigenvector rotations, promoting exploration of flatter, more generalizable regions of the loss landscape.
Findings
Large learning rates induce Hessian eigenvector rotations.
Instabilities lead to exploration of flatter loss landscape regions.
Gradient descent with large learning rates improves generalization.
Abstract
Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is 'stable' and training loss decreases monotonically. Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating outside this stable regime. In this work, we demonstrate that such instabilities, induced by large learning rates, move model parameters toward flatter regions of the loss landscape. Our crucial insight lies in noting that, during these instabilities, the orientation of the Hessian eigenvectors rotate. This, we conjecture, allows the model to explore regions of the loss landscape that display more desirable geometrical properties for generalization, such as flatness. These rotations are a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks
