The Impact of Local Geometry and Batch Size on Stochastic Gradient   Descent for Nonconvex Problems

Vivak Patel

arXiv:1709.04718·math.OC·May 8, 2018·6 cites

The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems

Vivak Patel

PDF

Open Access

TL;DR

This paper investigates why stochastic gradient descent tends to find flatter minima in nonconvex optimization, challenging existing stochastic explanations and proposing a new deterministic mechanism supported by theoretical analysis and experiments.

Contribution

It introduces a deterministic mechanism explaining SGD's preference for flat minima, based on analysis of stochastic quadratic problems and validation on nonconvex tasks.

Findings

01

SGD prefers flatter minima over sharper ones.

02

The proposed deterministic mechanism accurately predicts SGD behavior.

03

Experimental results support the new explanation.

Abstract

In several experimental reports on nonconvex optimization problems in machine learning, stochastic gradient descent (SGD) was observed to prefer minimizers with flat basins in comparison to more deterministic methods, yet there is very little rigorous understanding of this phenomenon. In fact, the lack of such work has led to an unverified, but widely-accepted stochastic mechanism describing why SGD prefers flatter minimizers to sharper minimizers. However, as we demonstrate, the stochastic mechanism fails to explain this phenomenon. Here, we propose an alternative deterministic mechanism that can accurately explain why SGD prefers flatter minimizers to sharper minimizers. We derive this mechanism based on a detailed analysis of a generic stochastic quadratic problem, which generalizes known results for classical gradient descent. Finally, we verify the predictions of our deterministic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Domain Adaptation and Few-Shot Learning

MethodsStochastic Gradient Descent