Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model
Yizhou Xu, Pierfrancesco Beneventano, Isaac Chuang, Liu Ziyin

TL;DR
This paper introduces an exactly solvable model to analyze SGD's behavior, revealing that its preference for flatness or sharpness depends on data distribution and label noise anisotropy, clarifying conflicting prior evidence.
Contribution
The work provides an analytically solvable model that explains when SGD prefers flat or sharp minima based on label noise properties, connecting theory with empirical observations.
Findings
SGD's flatness preference depends on label noise isotropy.
In anisotropic noise, SGD tends to find sharp minima.
Model reproduces behavior across MLP, RNN, and transformer architectures.
Abstract
A large body of theory and empirical work hypothesizes a connection between the flatness of a neural network's loss landscape during training and its performance. However, there have been conceptually opposite pieces of evidence regarding when SGD prefers flatter or sharper solutions during training. In this work, we partially but causally clarify the flatness-seeking behavior of SGD by identifying and exactly solving an analytically solvable model that exhibits both flattening and sharpening behavior during training. In this model, the SGD training has no \textit{a priori} preference for flatness, but only a preference for minimal gradient fluctuations. This leads to the insight that, at least within this model, it is data distribution that uniquely determines the sharpness at convergence, and that a flat minimum is preferred if and only if the noise in the labels is isotropic across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Neural Networks and Applications
