Shift-Curvature, SGD, and Generalization
Arwen V. Bradley, Carlos Alberto Gomez-Uribe, Manish Reddy Vuyyuru

TL;DR
This paper investigates how curvature affects model generalization, revealing that shift-curvature and bias-curvature mechanisms, along with SGD dynamics, influence test performance and can be mitigated by curvature minimization.
Contribution
It introduces the shift-curvature and bias-curvature mechanisms affecting generalization, and derives a new SGD steady-state distribution to explain the role of noise in curvature optimization.
Findings
Shift-curvature impacts test loss significantly.
Minimizing overall curvature can mitigate shift effects.
SGD noise mediates a trade-off between deep and low-curvature regions.
Abstract
A longstanding debate surrounds the related hypotheses that low-curvature minima generalize better, and that SGD discourages curvature. We offer a more complete and nuanced view in support of both. First, we show that curvature harms test performance through two new mechanisms, the shift-curvature and bias-curvature, in addition to a known parameter-covariance mechanism. The three curvature-mediated contributions to test performance are reparametrization-invariant although curvature is not. The shift in the shift-curvature is the line connecting train and test local minima, which differ due to dataset sampling or distribution shift. Although the shift is unknown at training time, the shift-curvature can still be mitigated by minimizing overall curvature. Second, we derive a new, explicit SGD steady-state distribution showing that SGD optimizes an effective potential related to but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Statistical and numerical algorithms · Gaussian Processes and Bayesian Inference
MethodsStochastic Gradient Descent
