Training Instabilities Induce Flatness Bias in Gradient Descent
Lawrence Wang, Stephen J. Roberts

TL;DR
This paper reveals that training instabilities in gradient descent can lead to flatter minima and better generalization in deep networks, challenging traditional stability-based views.
Contribution
It introduces the Rotational Polarity of Eigenvectors (RPE) as a geometric mechanism by which instabilities promote flatter minima, extending the theory to stochastic GD and Adam.
Findings
Instabilities induce a bias toward flatter regions of the loss landscape.
Rotational Polarity of Eigenvectors (RPE) explains eigenvector rotations during training.
Restoring instabilities in Adam improves generalization.
Abstract
Classical analyses of gradient descent (GD) define a stability threshold based on the largest eigenvalue of the loss Hessian, often termed sharpness. When the learning rate lies below this threshold, training is stable and the loss decreases monotonically. Yet, modern deep networks often achieve their best performance beyond this regime. We demonstrate that such instabilities induce an implicit bias in GD, driving parameters toward flatter regions of the loss landscape and thereby improving generalization. The key mechanism is the Rotational Polarity of Eigenvectors (RPE), a geometric phenomenon in which the leading eigenvectors of the Hessian rotate during training instabilities. These rotations, which increase with learning rates, promote exploration and provably lead to flatter minima. This theoretical framework extends to stochastic GD, where instability-driven flattening…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Generative Adversarial Networks and Image Synthesis
