SGD with a Constant Large Learning Rate Can Converge to Local Maxima

Liu Ziyin; Botao Li; James B. Simon; Masahito Ueda

arXiv:2107.11774·cs.LG·May 30, 2023·1 cites

SGD with a Constant Large Learning Rate Can Converge to Local Maxima

Liu Ziyin, Botao Li, James B. Simon, Masahito Ueda

PDF

Open Access

TL;DR

This paper demonstrates that stochastic gradient descent (SGD) with a constant large learning rate can converge to undesirable points like local maxima and prefers sharp minima, challenging common assumptions about its behavior.

Contribution

The paper constructs worst-case scenarios showing SGD can converge to local maxima, escape saddle points slowly, and favor sharp minima, emphasizing the need for comprehensive analysis.

Findings

01

SGD can converge to local maxima in certain landscapes.

02

SGD escapes saddle points arbitrarily slowly.

03

SGD prefers sharp minima over flat ones.

Abstract

Previous works on stochastic gradient descent (SGD) often focus on its success. In this work, we construct worst-case optimization problems illustrating that, when not in the regimes that the previous works often assume, SGD can exhibit many strange and potentially undesirable behaviors. Specifically, we construct landscapes and data distributions such that (1) SGD converges to local maxima, (2) SGD escapes saddle points arbitrarily slowly, (3) SGD prefers sharp minima over flat ones, and (4) AMSGrad converges to local maxima. We also realize results in a minimal neural network-like example. Our results highlight the importance of simultaneously analyzing the minibatch sampling, discrete-time updates rules, and realistic landscapes to understand the role of SGD in deep learning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Markov Chains and Monte Carlo Methods

MethodsStochastic Gradient Descent · AMSGrad