A Sufficient Condition for Convergences of Adam and RMSProp

Fangyu Zou; Li Shen; Zequn Jie; Weizhong Zhang; Wei Liu

arXiv:1811.09358·cs.LG·June 26, 2019·28 cites

A Sufficient Condition for Convergences of Adam and RMSProp

Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu

PDF

Open Access

TL;DR

This paper introduces a simple, verifiable condition based on algorithm parameters that guarantees the convergence of Adam and RMSProp in non-convex stochastic optimization, providing new insights into their behavior and divergence.

Contribution

It proposes an easy-to-check sufficient condition for convergence of Adam and RMSProp, and offers a novel perspective by relating Adam to weighted AdaGrad with momentum.

Findings

01

The sufficient condition guarantees convergence in large-scale non-convex problems.

02

Several Adam variants also satisfy the convergence condition.

03

Numerical experiments confirm the theoretical predictions.

Abstract

Adam and RMSProp are two of the most influential adaptive stochastic algorithms for training deep neural networks, which have been pointed out to be divergent even in the convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, etc., have been tried to promote Adam/RMSProp-type algorithms to converge. In contrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam/RMSProp for solving large-scale non-convex stochastic optimization. Moreover, we show that the convergences of several variants of Adam, such as AdamNC, AdaEMA,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM

MethodsAdaGrad · RMSProp · Adam