Can Stability be Detrimental? Better Generalization through Gradient   Descent Instabilities

Lawrence Wang; Stephen J. Roberts

arXiv:2412.17613·cs.LG·December 24, 2024

Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

Lawrence Wang, Stephen J. Roberts

PDF

Open Access

TL;DR

This paper shows that gradient descent instabilities caused by large learning rates can lead to better generalization by exploring flatter regions of the loss landscape through Hessian eigenvector rotations.

Contribution

It reveals that instabilities in gradient descent induce Hessian eigenvector rotations, promoting exploration of flatter, more generalizable regions of the loss landscape.

Findings

01

Large learning rates induce Hessian eigenvector rotations.

02

Instabilities lead to exploration of flatter loss landscape regions.

03

Gradient descent with large learning rates improves generalization.

Abstract

Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is 'stable' and training loss decreases monotonically. Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating outside this stable regime. In this work, we demonstrate that such instabilities, induced by large learning rates, move model parameters toward flatter regions of the loss landscape. Our crucial insight lies in noting that, during these instabilities, the orientation of the Hessian eigenvectors rotate. This, we conjecture, allows the model to explore regions of the loss landscape that display more desirable geometrical properties for generalization, such as flatness. These rotations are a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks