On Nonconvex Optimization for Machine Learning: Gradients,   Stochasticity, and Saddle Points

Chi Jin; Praneeth Netrapalli; Rong Ge; Sham M. Kakade; Michael I.; Jordan

arXiv:1902.04811·cs.LG·September 5, 2019·58 cites

On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points

Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, Michael I., Jordan

PDF

Open Access

TL;DR

This paper analyzes how perturbed gradient descent and stochastic gradient descent algorithms efficiently find second-order stationary points in high-dimensional nonconvex optimization problems typical in machine learning.

Contribution

It demonstrates that perturbed GD and SGD can avoid saddle points with only polylogarithmic dimension dependence, improving upon previous polynomial bounds.

Findings

01

Perturbed GD and SGD converge to second-order stationary points efficiently.

02

Dimension dependence of these algorithms is polylogarithmic, not polynomial.

03

Algorithms perform similarly in convergence time to first-order stationary points.

Abstract

Gradient descent (GD) and stochastic gradient descent (SGD) are the workhorses of large-scale machine learning. While classical theory focused on analyzing the performance of these methods in convex optimization problems, the most notable successes in machine learning have involved nonconvex optimization, and a gap has arisen between theory and practice. Indeed, traditional analyses of GD and SGD show that both algorithms converge to stationary points efficiently. But these analyses do not take into account the possibility of converging to saddle points. More recent theory has shown that GD and SGD can avoid saddle points, but the dependence on dimension in these analyses is polynomial. For modern machine learning, where the dimension can be in the millions, such dependence would be catastrophic. We analyze perturbed versions of GD and SGD and show that they are truly efficient---their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods

MethodsStochastic Gradient Descent