Simplicity Bias via Global Convergence of Sharpness Minimization
Khashayar Gatmiry, Zhiyuan Li, Sashank J. Reddi, Stefanie Jegelka

TL;DR
This paper demonstrates that label noise SGD converges to simple, low-rank solutions in two-layer neural networks by minimizing sharpness, revealing a geometric link between flatness and model simplicity.
Contribution
It establishes that under certain conditions, label noise SGD leads to rank-one solutions by minimizing sharpness on the zero-loss manifold, introducing a novel local geodesic convexity property.
Findings
Label noise SGD converges to rank-one solutions.
Sharpness minimization occurs on the zero-loss manifold.
A new geometric property of the Hessian trace is identified.
Abstract
The remarkable generalization ability of neural networks is usually attributed to the implicit bias of SGD, which often yields models with lower complexity using simpler (e.g. linear) and low-rank features. Recent works have provided empirical and theoretical evidence for the bias of particular variants of SGD (such as label noise SGD) toward flatter regions of the loss landscape. Despite the folklore intuition that flat solutions are 'simple', the connection with the simplicity of the final trained model (e.g. low-rank) is not well understood. In this work, we take a step toward bridging this gap by studying the simplicity structure that arises from minimizers of the sharpness for a class of two-layer neural networks. We show that, for any high dimensional training data and certain activations, with small enough step size, label noise SGD always converges to a network that replicates a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making · Optimization and Variational Analysis · Risk and Portfolio Optimization
MethodsStochastic Gradient Descent
