Large Spikes in Stochastic Gradient Descent: A Large-Deviations View
Benjamin Gess, Daniel Heydecker

TL;DR
This paper analyzes large loss spikes in stochastic gradient descent using large deviations theory, revealing their role in escaping sharp minima and promoting flatter solutions in neural networks.
Contribution
It introduces a large-deviations framework for understanding SGD spikes, distinguishing inflationary and deflationary regimes, and links spikes to minima escape and curvature reduction.
Findings
Large spikes are polynomially likely in SGD.
Spikes facilitate escape from sharp minima.
Results extend to ReLU networks and impact curriculum learning.
Abstract
Large loss spikes in stochastic gradient descent are studied through a rigorous large-deviations analysis for a shallow, fully connected network in the NTK scaling. In contrast to full-batch gradient descent, the catapult phase is shown to split into inflationary and deflationary regimes, determined by an explicit log-drift criterion. In both cases, large spikes are shown to be at least polynomially likely. In addition, these spikes are shown to be the dominant mechanism by which sharp minima are escaped and curvature is reduced, thereby favouring flatter solutions. Corresponding results are also obtained for certain ReLU networks, and implications for curriculum learning are derived.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
