The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems
Vivak Patel

TL;DR
This paper investigates why stochastic gradient descent tends to find flatter minima in nonconvex optimization, challenging existing stochastic explanations and proposing a new deterministic mechanism supported by theoretical analysis and experiments.
Contribution
It introduces a deterministic mechanism explaining SGD's preference for flat minima, based on analysis of stochastic quadratic problems and validation on nonconvex tasks.
Findings
SGD prefers flatter minima over sharper ones.
The proposed deterministic mechanism accurately predicts SGD behavior.
Experimental results support the new explanation.
Abstract
In several experimental reports on nonconvex optimization problems in machine learning, stochastic gradient descent (SGD) was observed to prefer minimizers with flat basins in comparison to more deterministic methods, yet there is very little rigorous understanding of this phenomenon. In fact, the lack of such work has led to an unverified, but widely-accepted stochastic mechanism describing why SGD prefers flatter minimizers to sharper minimizers. However, as we demonstrate, the stochastic mechanism fails to explain this phenomenon. Here, we propose an alternative deterministic mechanism that can accurately explain why SGD prefers flatter minimizers to sharper minimizers. We derive this mechanism based on a detailed analysis of a generic stochastic quadratic problem, which generalizes known results for classical gradient descent. Finally, we verify the predictions of our deterministic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Domain Adaptation and Few-Shot Learning
MethodsStochastic Gradient Descent
