Escaping Saddles with Stochastic Gradients

Hadi Daneshmand; Jonas Kohler; Aurelien Lucchi; Thomas Hofmann

arXiv:1803.05999·cs.LG·September 18, 2018·57 cites

Escaping Saddles with Stochastic Gradients

Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, Thomas Hofmann

PDF

Open Access

TL;DR

This paper demonstrates that stochastic gradients inherently contain directional information useful for escaping saddle points in non-convex optimization, enabling simpler algorithms to achieve convergence without added noise.

Contribution

It introduces a new assumption showing SGD's natural ability to escape saddles, and provides the first dimension-independent convergence rate for plain SGD to second-order stationary points.

Findings

01

Stochastic gradients have strong variance along negative curvature directions.

02

Variance of stochastic gradients is proportional to eigenvalues, not dimension.

03

Plain SGD can converge to second-order stationary points without explicit noise.

Abstract

We analyze the variance of stochastic gradients along negative curvature directions in certain non-convex machine learning models and show that stochastic gradients exhibit a strong component along these directions. Furthermore, we show that - contrary to the case of isotropic noise - this variance is proportional to the magnitude of the corresponding eigenvalues and not decreasing in the dimensionality. Based upon this observation we propose a new assumption under which we show that the injection of explicit, isotropic noise usually applied to make gradient descent escape saddle points can successfully be replaced by a simple SGD step. Additionally - and under the same condition - we derive the first convergence rate for plain SGD to a second-order stationary point in a number of iterations that is independent of the problem dimension.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Sparse and Compressive Sensing Techniques

MethodsStochastic Gradient Descent