Escaping Saddle Points with Adaptive Gradient Methods

Matthew Staib; Sashank J. Reddi; Satyen Kale; Sanjiv Kumar; Suvrit Sra

arXiv:1901.09149·cs.LG·February 4, 2020·23 cites

Escaping Saddle Points with Adaptive Gradient Methods

Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of adaptive gradient methods like Adam and RMSProp, showing they effectively escape saddle points faster than SGD by estimating a preconditioner that rescales gradient noise.

Contribution

It offers a novel view of adaptive methods as preconditioned SGD and provides the first second-order convergence result for these methods in nonconvex optimization.

Findings

01

Adaptive methods rescale gradient noise to escape saddle points.

02

They can estimate the preconditioner efficiently.

03

Adaptive methods converge faster to second-order stationary points than SGD.

Abstract

Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods

MethodsStochastic Gradient Descent · RMSProp · Adam