Gradient Descent Can Take Exponential Time to Escape Saddle Points
Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Barnabas, Poczos, Aarti Singh

TL;DR
This paper demonstrates that standard gradient descent can take exponential time to escape saddle points, whereas perturbed gradient descent can do so efficiently, highlighting the importance of perturbations in non-convex optimization.
Contribution
It provides a theoretical example where gradient descent is exponentially slow at escaping saddle points, contrasting with the polynomial-time performance of perturbed gradient descent.
Findings
Gradient descent can take exponential time to escape saddle points.
Perturbed gradient descent escapes saddle points in polynomial time.
Experiments support the theoretical results.
Abstract
Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points - it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Topological and Geometric Data Analysis
