AdaShift: Decorrelation and Convergence of Adaptive Learning Rate   Methods

Zhiming Zhou; Qingru Zhang; Guansong Lu; Hongwei Wang; Weinan Zhang,; Yong Yu

arXiv:1810.00143·cs.LG·June 25, 2019·29 cites

AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

Zhiming Zhou, Qingru Zhang, Guansong Lu, Hongwei Wang, Weinan Zhang,, Yong Yu

PDF

Open Access 3 Repos

TL;DR

This paper identifies the cause of Adam's non-convergence as correlation between gradients and second-moment estimates, and proposes AdaShift, a new method that decorrelates them to ensure convergence without sacrificing performance.

Contribution

The paper introduces AdaShift, a novel adaptive learning rate algorithm that decorrelates gradient and second-moment estimates to guarantee convergence.

Findings

01

AdaShift effectively addresses Adam's non-convergence issue.

02

AdaShift maintains competitive training speed and generalization.

03

Decorrelating gradient estimates leads to unbiased step sizes.

Abstract

Adam is shown not being able to converge to the optimal solution in certain cases. Researchers recently propose several algorithms to avoid the issue of non-convergence of Adam, but their efficiency turns out to be unsatisfactory in practice. In this paper, we provide new insight into the non-convergence issue of Adam as well as other adaptive learning rate methods. We argue that there exists an inappropriate correlation between gradient $g_{t}$ and the second-moment term $v_{t}$ in Adam ( $t$ is the timestep), which results in that a large gradient is likely to have small step size while a small gradient may have a large step size. We demonstrate that such biased step sizes are the fundamental cause of non-convergence of Adam, and we further prove that decorrelating $v_{t}$ and $g_{t}$ will lead to unbiased step size for each gradient, thus solving the non-convergence problem of Adam. Finally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Domain Adaptation and Few-Shot Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · AdaShift · Adam