On the Convergence Proof of AMSGrad and a New Version

Tran Thi Phuong; Le Trieu Phong

arXiv:1904.03590·cs.LG·November 1, 2019

On the Convergence Proof of AMSGrad and a New Version

Tran Thi Phuong, Le Trieu Phong

PDF

TL;DR

This paper critically examines the convergence proofs of AMSGrad, identifies issues with hyper-parameter handling, and proposes fixes including a new version called AdamX, supported by theoretical analysis and experiments.

Contribution

It reveals flaws in the convergence proof of AMSGrad, provides a corrected proof, and introduces a new variant AdamX with empirical validation.

Findings

01

Convergence proof of AMSGrad is flawed due to hyper-parameter handling.

02

A corrected convergence proof for AMSGrad is provided.

03

The new AdamX algorithm outperforms previous variants on benchmark datasets.

Abstract

The adaptive moment estimation algorithm Adam (Kingma and Ba) is a popular optimizer in the training of deep neural networks. However, Reddi et al. have recently shown that the convergence proof of Adam is problematic and proposed a variant of Adam called AMSGrad as a fix. In this paper, we show that the convergence proof of AMSGrad is also problematic. Concretely, the problem in the convergence proof of AMSGrad is in handling the hyper-parameters, treating them as equal while they are not. This is also the neglected issue in the convergence proof of Adam. We provide an explicit counter-example of a simple convex optimization setting to show this neglected issue. Depending on manipulating the hyper-parameters, we present various fixes for this issue. We provide a new convergence proof for AMSGrad as the first fix. We also propose a new version of AMSGrad called AdamX as another fix. Our…

Equations214

R (T)

R (T)

\frac{v ^ _{t + 1}}{α _{t + 1}} - \frac{v ^ _{t}}{α _{t}}

\frac{v ^ _{t + 1}}{α _{t + 1}} - \frac{v ^ _{t}}{α _{t}}

\frac{v ^ _{t + 1}}{α _{t + 1} ( 1 - β _{1, t + 1} )} - \frac{v ^ _{t}}{α _{t} ( 1 - β _{1, t} )}

\frac{v ^ _{t + 1}}{α _{t + 1} ( 1 - β _{1, t + 1} )} - \frac{v ^ _{t}}{α _{t} ( 1 - β _{1, t} )}

R (T) = t = 1 \sum T [f_{t} (x_{t}) - f_{t} (x^{*})],

R (T) = t = 1 \sum T [f_{t} (x_{t}) - f_{t} (x^{*})],

λ f (x) + (1 - λ) f (y) \geq f (λ x + (1 - λ) y) .

λ f (x) + (1 - λ) f (y) \geq f (λ x + (1 - λ) y) .

f (y) \geq f (x) + \nabla f (x)^{T} (y - x),

f (y) \geq f (x) + \nabla f (x)^{T} (y - x),

(i = 1 \sum n u_{i} v_{i})^{2} \leq (i = 1 \sum n u_{i}^{2}) (i = 1 \sum n v_{i}^{2}) .

(i = 1 \sum n u_{i} v_{i})^{2} \leq (i = 1 \sum n u_{i}^{2}) (i = 1 \sum n v_{i}^{2}) .

t \geq 1 \sum α^{t} = \frac{1}{1 - α}

t \geq 1 \sum α^{t} = \frac{1}{1 - α}

t \geq 1 \sum t α^{t - 1} = \frac{1}{( 1 - α ) ^{2}} .

t \geq 1 \sum t α^{t - 1} = \frac{1}{( 1 - α ) ^{2}} .

n = 1 \sum N \frac{1}{n} \leq ln N + 1.

n = 1 \sum N \frac{1}{n} \leq ln N + 1.

n = 1 \sum N \frac{1}{n} \leq 2 N .

n = 1 \sum N \frac{1}{n} \leq 2 N .

\frac{\sum _{i = 1}^{n} a _{i}}{\sum _{j = 1}^{n} b _{j}} \leq i = 1 \sum n \frac{a _{i}}{b _{i}} .

\frac{\sum _{i = 1}^{n} a _{i}}{\sum _{j = 1}^{n} b _{j}} \leq i = 1 \sum n \frac{a _{i}}{b _{i}} .

∥ Q^{1/2} (u_{1} - u_{2})∥ \leq ∥ Q^{1/2} (z_{1} - z_{2})∥ .

∥ Q^{1/2} (u_{1} - u_{2})∥ \leq ∥ Q^{1/2} (z_{1} - z_{2})∥ .

R (T)

R (T)

x_{t + 1} = F, \hat{V}_{t} \prod (x_{t} - α_{t} \cdot \hat{V}_{t}^{- 1/2} m_{t}) = x \in F min ∥ \hat{V}^{1/4} (x - (x_{t} - α_{t} \hat{V}^{- 1/2} m_{t}))∥

x_{t + 1} = F, \hat{V}_{t} \prod (x_{t} - α_{t} \cdot \hat{V}_{t}^{- 1/2} m_{t}) = x \in F min ∥ \hat{V}^{1/4} (x - (x_{t} - α_{t} \hat{V}^{- 1/2} m_{t}))∥

∥ \hat{V}^{1/4} (x_{t + 1} - x^{*}) ∥^{2}

∥ \hat{V}^{1/4} (x_{t + 1} - x^{*}) ∥^{2}

⟨ g_{t}, x_{t} - x^{*} ⟩

⟨ g_{t}, x_{t} - x^{*} ⟩

i = 1 \sum d g_{t, i} (x_{t, i} - x_{, i}^{*})

i = 1 \sum d g_{t, i} (x_{t, i} - x_{, i}^{*})

f_{t} (x_{t}) - f_{t} (x^{*}) \leq g_{t}^{T} (x_{t} - x^{*}) = i = 1 \sum d g_{t, i} (x_{t, i} - x_{, i}^{*}) .

f_{t} (x_{t}) - f_{t} (x^{*}) \leq g_{t}^{T} (x_{t} - x^{*}) = i = 1 \sum d g_{t, i} (x_{t, i} - x_{, i}^{*}) .

R (T) = t = 1 \sum T [f_{t} (x_{t}) - f_{t} (x^{*})] \leq t = 1 \sum T g_{t}^{T} (x_{t} - x^{*}) = t = 1 \sum T i = 1 \sum d g_{t, i} (x_{t, i} - x_{, i}^{*}) .

R (T) = t = 1 \sum T [f_{t} (x_{t}) - f_{t} (x^{*})] \leq t = 1 \sum T g_{t}^{T} (x_{t} - x^{*}) = t = 1 \sum T i = 1 \sum d g_{t, i} (x_{t, i} - x_{, i}^{*}) .

R (T)

R (T)

m_{t - 1, t} (x_{, i}^{*} - x_{t, i})

m_{t - 1, t} (x_{, i}^{*} - x_{t, i})

i = 1 \sum d t = 1 \sum T \frac{β _{1, t}}{1 - β _{1, t}} m_{t - 1, i} (x_{, i}^{*} - x_{t, i})

i = 1 \sum d t = 1 \sum T \frac{β _{1, t}}{1 - β _{1, t}} m_{t - 1, i} (x_{, i}^{*} - x_{t, i})

R (T)

R (T)

i = 1 \sum d t = 2 \sum T \frac{β _{1, t} v ^ _{t - 1, i}}{2 α _{t - 1} ( 1 - β _{1, t} )} (x_{t, i} - x_{, i}^{*})^{2}

i = 1 \sum d t = 2 \sum T \frac{β _{1, t} v ^ _{t - 1, i}}{2 α _{t - 1} ( 1 - β _{1, t} )} (x_{t, i} - x_{, i}^{*})^{2}

i = 1 \sum d t = 2 \sum T \frac{β _{1, t} α _{t - 1}}{2 ( 1 - β _{1, t} )} \frac{m _{t - 1, i}^{2}}{v ^ _{t - 1, i}}

i = 1 \sum d t = 2 \sum T \frac{β _{1, t} α _{t - 1}}{2 ( 1 - β _{1, t} )} \frac{m _{t - 1, i}^{2}}{v ^ _{t - 1, i}}

i = 1 \sum d t = 1 \sum T \frac{α _{t}}{2 ( 1 - β _{1, t} )} \frac{m _{t, i}^{2}}{v ^ _{t, i}} + i = 1 \sum d t = 2 \sum T \frac{β _{1, t} α _{t - 1}}{2 ( 1 - β _{1, t} )} \frac{m _{t - 1, i}^{2}}{v ^ _{t - 1, i}}

i = 1 \sum d t = 1 \sum T \frac{α _{t}}{2 ( 1 - β _{1, t} )} \frac{m _{t, i}^{2}}{v ^ _{t, i}} + i = 1 \sum d t = 2 \sum T \frac{β _{1, t} α _{t - 1}}{2 ( 1 - β _{1, t} )} \frac{m _{t - 1, i}^{2}}{v ^ _{t - 1, i}}

i = 1 \sum d t = 1 \sum T \frac{v ^ _{t, i}}{2 α _{t} ( 1 - β _{1, t} )} ((x_{t, i} - x_{, i}^{*})^{2} - (x_{t + 1, i} - x_{, i}^{*})^{2}),

i = 1 \sum d t = 1 \sum T \frac{v ^ _{t, i}}{2 α _{t} ( 1 - β _{1, t} )} ((x_{t, i} - x_{, i}^{*})^{2} - (x_{t + 1, i} - x_{, i}^{*})^{2}),

i = 1 \sum d t = 1 \sum T \frac{α _{t}}{1 - β _{1}} \frac{m _{t, i}^{2}}{v ^ _{t, i}},

i = 1 \sum d t = 1 \sum T \frac{α _{t}}{1 - β _{1}} \frac{m _{t, i}^{2}}{v ^ _{t, i}},

i = 1 \sum d t = 2 \sum T \frac{β _{1, t} v ^ _{t - 1, i}}{2 α _{t - 1} ( 1 - β _{1} )} (x_{t, i} - x_{, i}^{*})^{2} .

i = 1 \sum d t = 2 \sum T \frac{β _{1, t} v ^ _{t - 1, i}}{2 α _{t - 1} ( 1 - β _{1} )} (x_{t, i} - x_{, i}^{*})^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAdam

Full text

On the Convergence Proof of AMSGrad and a New Version⋆

Tran Thi Phuong*(∗,∗∗,∗∗∗)*

Faculty of Mathematics and Statistics, Ton Duc Thang University, Ho Chi Minh City, Vietnam. Postal address: 19 Nguyen Huu Tho street, Tan Phong ward, District 7, Ho Chi Minh City, Vietnam.

Meiji University. Postal address: 1-1-1 Higashi-Mita, Tama-ku, Kawasaki-shi, Kanagawa 214-8571, Japan.

[email protected]

and

Le Trieu Phong*(∗∗∗)*

National Institute of Information and Communications Technology (NICT). Postal address: 4-2-1, Nukui-Kitamachi, Koganei, Tokyo 184-8795, Japan.

[email protected]

Abstract.

The adaptive moment estimation algorithm Adam (Kingma and Ba) is a popular optimizer in the training of deep neural networks. However, Reddi et al. have recently shown that the convergence proof of Adam is problematic and proposed a variant of Adam called AMSGrad as a fix. In this paper, we show that the convergence proof of AMSGrad is also problematic. Concretely, the problem in the convergence proof of AMSGrad is in handling the hyper-parameters, treating them as equal while they are not. This is also the neglected issue in the convergence proof of Adam. We provide an explicit counter-example of a simple convex optimization setting to show this neglected issue. Depending on manipulating the hyper-parameters, we present various fixes for this issue. We provide a new convergence proof for AMSGrad as the first fix. We also propose a new version of AMSGrad called AdamX as another fix. Our experiments on the benchmark dataset also support our theoretical results.

Key words and phrases. Optimizer, adaptive moment estimation, Adam, AMSGrad, deep neural networks.

⋆A version of this paper appears at IEEE Access DOI: 10.1109/ACCESS.2019.2916341

1. Introduction and our contributions

One of the most popular algorithms for training deep neural networks is stochastic gradient descent (SGD) [1] and its variants. Among the various variants of SGD, the algorithm with the adaptive moment estimation Adam [2] is widely used in practice. However, Reddi et al. [3] have recently shown that the convergence proof of Adam is problematic and proposed a variant of Adam called AMSGrad to solve this issue.

Our contribution. In this paper, we point out a flaw in the convergence proof of AMSGrad. We then fix this flaw by providing a new convergence proof for AMSGrad in the case of special parameters. In addition, in the case of general parameters, we propose a new and slightly modified version of AMSGrad.

To provide more details, let us recall AMSGrad in Algorithm 1, in which the mathematical notation can be fully found in Section 2.

The main theorem for the convergence of AMSGrad in [3] is as follows. To simplify the notation, we define $g_{t}\overset{\Delta}{=}\nabla_{x}f_{t}(x_{t})$ , $g_{t,i}$ as the $i^{\text{th}}$ element of $g_{t}$ and $g_{1:t,i}\in\mathbb{R}^{t}$ as a vector that contains the $i^{\text{th}}$ dimension of the gradients over all iterations up to $t$ , namely, $g_{1:t,i}=[g_{1,i},g_{2,i},...,g_{t,i}]$ .

Theorem A* (Theorem 4 in [3], problematic).*

Let $x_{t}$ and $v_{t}$ be the sequences obtained from Algorithm 1, $\alpha_{t}=\frac{\alpha}{\sqrt{t}}$ , $\beta_{1}=\beta_{1,1}$ , $\beta_{1,t}\leq\beta_{1}$ for all $t\in[T]$ and $\frac{\beta_{1}}{\sqrt{\beta_{2}}}\leq 1$ . Assume that $\mathcal{F}$ has bounded diameter $D_{\infty}$ and $\lVert{\nabla f_{t}(x)}\rVert_{\infty}\leq G_{\infty}$ for all $t\in[T]$ and $x\in\mathcal{F}$ . For $x_{t}$ generated using AMSGrad (Algorithm 1), we have the following bound on the regret:

[TABLE]

In their proof for Theorem A, Reddi et al. resolved an issue on the so-called telescopic sum in the convergence proof of Adam ([2, Theorem 10.5]). Specifically, Reddi et al. adjusted $\hat{v}_{t}$ such that all components in the vector

[TABLE]

are always positive. However, there is another issue (showed in Section 3) in the convergence proof of Adam that AMSGrad unfortunately neglects. The issue affects both the correctness of Reddi et al.’s proof and the upper bound for the regret in Theorem A. To deal with the issue in a general way, we propose to modify Algorithm 1 such that all components in the vector

[TABLE]

are always positive. The differences with (1.0.1) are highlighted in the boxes for clarity.

Paper roadmap. We begin with preliminaries in Section 2. We show where the proof of Theorem A becomes invalid in Section 3. After that, we suggest two ways to resolve the issue in Sections 4 and 5.

Subsequent works. The first version of this paper publicly appeared on arXiv on 7 April 2019. On 19 April 2019, Reddi et al. revised their original proofs111https://arxiv.org/abs/1904.09237. The revised proof does not suffer from the issue pointed out in Section 3 of this paper, although yielding a constant factor missing in the original claims.

2. Preliminaries

Notation. Given a sequence of vectors $\{x_{t}\}_{1\leq t\leq T}(1\leq T\in\mathbb{N})$ in $\mathbb{R}^{d}$ , we denote its $i^{\text{th}}$ coordinate by $x_{t,i}$ and use $x_{t}^{k}$ to denote the elementwise power of $k$ and $\lVert{x_{t}}\lVert_{2}$ , resp. $\lVert{x_{t}}\lVert_{\infty}$ , to denote its $\ell_{2}$ -norm, resp. $\ell_{\infty}$ -norm. Let $\mathcal{F}\subseteq\mathbb{R}^{d}$ be a feasible set of points such that $\mathcal{F}$ has bounded diameter $D_{\infty}$ , that is, $\lVert{x-y}\rVert_{\infty}\leq D_{\infty}$ for all $x,y\in\mathcal{F}$ , and $\mathcal{S}^{d}_{+}$ denote the set of all positive definite $d\times d$ matrices. For a matrix $A\in\mathcal{S}^{d}_{+}$ , we denote $A^{1/2}$ for the square root of $A$ . The projection operation $\prod_{\mathcal{F},A}(y)$ for $A\in\mathcal{S}^{d}_{+}$ is defined as $\mathrm{argmin}_{x\in\mathcal{F}}\lVert{A^{1/2}(x-y)}\lVert_{2}$ for all $y\in\mathbb{R}^{d}$ . When $d=1$ and $\mathcal{F}\subset\mathbb{R}$ , the positive definite matrix $A$ is a positive number, so that the projection $\prod_{\mathcal{F},A}(y)$ becomes $\mathrm{argmin}_{x\in\mathcal{F}}|x-y|$ . We use $\langle x,y\rangle$ to denote the inner product between $x$ and $y\in\mathbb{R}^{d}$ . The gradient of a function $f$ evaluated at $x\in\mathbb{R}^{d}$ is denoted by $\nabla f(x)$ . For vectors $x,y\in\mathbb{R}^{d}$ , we use $\sqrt{x}$ or $x^{1/2}$ for element-wise square root, $x^{2}$ for element-wise square, $x/y$ to denote element-wise division. For an integer $n\in\mathbb{N}$ , we denote by $[n]$ the set of integers $\{1,2,...,n\}$ .

Optimization setup. Let $f_{1},f_{2},...,f_{T}:\mathbb{R}^{d}\to\mathbb{R}$ be an arbitrary sequence of convex cost functions and $x_{1}\in\mathbb{R}^{d}$ . At each time $t\geq 1$ , the goal is to predict the parameter $x_{t}$ and evaluate it on a previously unknown cost function $f_{t}$ . Since the nature of the sequence is unknown in advance, the algorithm is evaluated by using the regret, that is, the sum of all the previous differences between the online prediction $f_{t}(x_{t})$ and the best fixed-point parameter $f_{t}(x^{*})$ from a feasible set $\mathcal{F}$ for all the previous steps. Concretely, the regret is defined as

[TABLE]

where $x^{*}=\text{argmin}_{x\in\mathcal{F}}\sum_{t=1}^{T}f_{t}(x)$ .

Definition 2.1.

A function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is convex if for all $x,y\in\mathbb{R}^{d}$ , and all $\lambda\in[0,1]$ ,

[TABLE]

Lemma 2.2.

If a function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is convex, then for all $x,y\in\mathbb{R}^{d}$ ,

[TABLE]

where $(-)^{{\sf T}}$ denotes the transpose of $(-)$ .

Lemma 2.3 (Cauchy–Schwarz inequality).

For all $n\geq 1$ , $u_{i},v_{i}\in\mathbb{R}(1\leq i\leq n)$ ,

[TABLE]

Lemma 2.4 (Taylor series).

For $\alpha\in\mathbb{R}$ and $0<\alpha<1$ ,

[TABLE]

and

[TABLE]

Lemma 2.5 (Upper bound for the harmonic series).

For $N\in\mathbb{N}$ ,

[TABLE]

Lemma 2.6.

For $N\in\mathbb{N}$ ,

[TABLE]

Lemma 2.7.

For all $n\in\mathbb{N}$ and $a_{i},b_{i}\in\mathbb{R}$ such that $a_{i}\geq 0$ and $b_{i}>0$ for all $i\in[n]$ ,

[TABLE]

Lemma 2.8.

[3, Lemma 3 in arXiv version]** For any $Q\in\mathcal{S}^{d}_{+}$ and convex feasible set $\mathcal{F}\subseteq\mathbb{R}^{d}$ , suppose $u_{1}=\min_{x\in\mathcal{F}}\lVert Q^{1/2}(x-z_{1})\rVert$ and $u_{2}=\min_{x\in\mathcal{F}}\lVert Q^{1/2}(x-z_{2})\rVert$ . Then, we have

[TABLE]

3. Issue in the convergence proof of AMSGrad

Before showing the issue in the convergence proof of AMSGrad, let us recall and prove the following inequality, which also appears in [3].

Lemma 3.1.

Algorithm 1 achieves the following guarantee, for all $T\geq 1$ :

[TABLE]

Proof.

We note that

[TABLE]

and $\prod_{\mathcal{F},\sqrt{\hat{V}_{t}}}(x^{*})=x^{*}$ for all $x^{*}\in\mathcal{F}$ . For all $1\leq t\leq T$ , put $g_{t}=\nabla_{x}f_{t}(x_{t})$ . Using Lemma 2.8 with $u_{1}=x_{t+1}$ and $u_{2}=x^{*}$ , we have

[TABLE]

This yields

[TABLE]

Therefore, we obtain

[TABLE]

Moreover, by Lemma 2.2, we have $f_{t}(x^{*})-f_{t}(x_{t})\geq g_{t}^{\text{T}}(x^{*}-x_{t})$ , where $g_{t}^{{\sf T}}$ denotes the transpose of vector $g_{t}$ . This means that

[TABLE]

Hence,

[TABLE]

Combining (3) with (3.1.2), we obtain

[TABLE]

On the other hand, for all $t\geq 2$ , we have

[TABLE]

where the inequality is from the fact that $ab\leq a^{2}/2+b^{2}/2$ for any $a,b$ . Hence,

[TABLE]

Therefore, we obtain

[TABLE]

Since $\beta_{1,t}\leq\beta_{1}(1\leq t\leq T)$ , we obtain

[TABLE]

Moreover, we have

[TABLE]

where the last inequality is from the assumption that $\beta_{1,t}\leq\beta_{1}<1(1\leq t\leq T)$ . Therefore,

[TABLE]

and we obtain the desired bound for $R(T)$ . ∎

Issue in the convergence proof of AMSGrad. We denote the terms on the right hand-side of the upper bound for $R(T)$ in Lemma 3.1 as

[TABLE]

and

[TABLE]

The issue in the proof of the convergence theorem of AMSGrad [3, Theorem 4] becomes on examining the term (3.1.3). Indeed, in [3, page 18], Reddi et al. used222Concretely, on page 18 of [3], it is stated that “The […] inequality use the fact that $\beta_{1,t}\leq\beta_{1}$ .” the property that $\beta_{1,t}\leq\beta_{1}$ , and hence

[TABLE]

to replace all $\beta_{1,t}$ by $\beta_{1}$ as

[TABLE]

However, the first inequality (in red) is not guaranteed because the quantity

[TABLE]

in (3.1.3) may be both negative and positive as shown in Example 3.2. This is also a neglected issue in the convergence proofs in Kingma and Ba [2, Theorem 10.5], Luo et al. [5, Theorem 4], Bock et al. [6, Theorem 4.4], and Chen and Gu [7, Theorem 4.2].

Example 3.2 (for AMSGrad convergence proof).

We use the function in the Synthetic Experiment of Reddi et al. [3, Page 6]

[TABLE]

with the constraint set $\mathcal{F}=[-1,1]$ . The optimal solution is $x^{*}=-1$ . By the proof of [3, Theorem 1], the initial point $x_{1}=1$ . By Algorithm 1, $m_{0}=0$ , $v_{0}=0$ , and $\hat{v}_{0}=0$ . We choose $\beta_{1}=0.9$ , $\beta_{1,t}=\beta_{1}\lambda^{t-1}$ , where $\lambda=0.001$ , $\beta_{2}=0.999$ , and $\alpha_{t}=\alpha/\sqrt{t}$ , where $\alpha=0.001$ . Under this setting, we have $f_{1}(x_{1})=1010x_{1}$ , $f_{2}(x_{2})=-10x_{2}$ , $f_{3}(x_{3})=-10x_{3}$ and hence

[TABLE]

Therefore,

[TABLE]

Since $x_{1}-\alpha_{1}m_{1}/\sqrt{\hat{v}_{1}}>0$ , we have

[TABLE]

Hence,

[TABLE]

At $t=2$ , we have

[TABLE]

Therefore,

[TABLE]

Since $x_{2}-\alpha_{2}m_{2}/\sqrt{\hat{v}_{2}}>0$ , we obtain

[TABLE]

Hence,

[TABLE]

Outline of our solution. Let us rewrite (3.1.3) as

[TABLE]

Omitting the term $\sum_{i=1}^{d}\frac{\sqrt{\hat{v}_{T,i}}}{2\alpha_{T}(1-\beta_{1,T})}(x_{T+1,i}-x^{*}_{,i})^{2}$ , we obtain

[TABLE]

in which the differences with Reddi et al. [3] are highlighted in the boxes, namely, $\boxed{\beta_{1,t}}$ and $\boxed{\beta_{1,t-1}}$ instead of $\beta_{1}$ .

We suggest two ways to overcome these differences depending on the setting of $\beta_{1,t}(1\leq t\leq T)$ :

•

**In Section 4: ** If either $\beta_{1,t}\overset{\Delta}{=}\beta_{1}\lambda^{t-1}$ or $\beta_{1,t}\overset{\Delta}{=}1/t$ , $(1\leq t\leq T)$ , where $0\leq\beta_{1}<1$ and $0<\lambda<1$ , then we give a new convergence theorem for AMSGrad in Section 4.

•

In Section 5: If the setting for $\beta_{1,t}(1\leq t\leq T)$ is general, as in the statement of Theorem A, then we suggest a new (slightly modified) version for AMSGrad in Section 5.

4. New convergence theorem for AMSGrad

When either $\beta_{1,t}\overset{\Delta}{=}\beta_{1}\lambda^{t-1}$ or $\beta_{1,t}\overset{\Delta}{=}1/t$ , $(1\leq t\leq T)$ , where $0\leq\beta_{1}<1$ and $0<\lambda<1$ , Theorem A can be fixed as follows, in which the upper bounds of the regret $R(T)$ are changed.

Theorem 4.1 (Fixes for Theorem A).

Let $x_{t}$ and $v_{t}$ be the sequences obtained from Algorithm 1, $\alpha_{t}=\frac{\alpha}{\sqrt{t}}$ , either $\beta_{1,t}=\beta_{1}\lambda^{t-1}$ , where $\lambda\in(0,1)$ , or $\beta_{1,t}=\frac{\beta_{1}}{t}$ for all $t\in[T]$ and $\gamma=\frac{\beta_{1}}{\sqrt{\beta_{2}}}\leq 1$ . Assume that $\mathcal{F}$ has bounded diameter $D_{\infty}$ and $\lVert{\nabla f_{t}(x)}\rVert_{\infty}\leq G_{\infty}$ for all $t\in[T]$ and $x\in\mathcal{F}$ . For $x_{t}$ generated using AMSGrad (Algorithm 1), we have the following bound on the regret. Then, there is some $1\leq t_{0}\leq T$ such that AMSGrad achieves the following guarantee for all $T\geq 1$ :

[TABLE]

provided $\beta_{1,t}=\beta_{1}\lambda^{t-1}$ , and

[TABLE]

provided $\beta_{1,t}=\frac{\beta_{1}}{t}$ .

To prove Theorem 4.1, we need the following Lemmas 4.2, 4.3, and 4.4.

Lemma 4.2.

$\sqrt{\hat{v}_{t}}\leq G_{\infty}$ .

Proof.

From the definition of $\hat{v}_{t}$ in AMSGrad’s algorithm, it is implied that $\hat{v}_{t}=\max\{v_{1},...,v_{t}\}$ . Therefore, there is some $1\leq s\leq t$ such that $\hat{v}_{t}=v_{s}$ . Hence,

[TABLE]

where the last inequality is by Lemma 2.4. ∎

Lemma 4.3.

If either $\beta_{1,t}=\beta_{1}\lambda^{t-1}$ or $\beta_{1,t}=\beta_{1}/t$ , then there exists some $t_{0}$ such that for every $t>t_{0}$ ,

[TABLE]

Proof.

Since $\hat{v}_{t,i}\geq\hat{v}_{t-1,i}$ , it is sufficient to prove that there exists some $t_{0}$ such that for every $t>t_{0}$ ,

[TABLE]

In other word,

[TABLE]

When $\beta_{1,t}=\beta_{1}/t$ , from (4.3.1) we have

[TABLE]

When $\beta_{1,t}=\beta_{1}\lambda^{t-1}$ , (4.3.1) have the following form

[TABLE]

Since $\beta_{1}$ and $\lambda$ are smaller than $1$ , it is easy to see that when $t$ is sufficiently large, meaning that $t>t_{0}$ for some $t_{0}$ , the left-hand side of (4.3.2) is $1-O(1/t^{2})$ and the left-hand side of (4.3.3) is larger than $1-\beta_{1}\lambda^{t-2}=1-O(\lambda^{t-2})$ . Therefore, (4.3.2) and (4.3.3) hold when $t$ is sufficiently large. ∎

Lemma 4.4.

For the parameter settings and conditions assumed in Theorem 4.1, we have

[TABLE]

Proof.

The proof is almost identical to that of [3, Lemma 2]. Since for all $t\geq 1$ , $\hat{v}_{t,i}\geq v_{t,i}$ , we have

[TABLE]

where the second inequality is by Lemma 2.3, the third inequality is from the properties of $\beta_{1,k}\leq 1$ and $\beta_{1,k}\leq\beta_{1}$ for all $1\leq k\leq T$ , and the fourth inequality is obtained by applying Lemma 2.4 to $\sum_{k=1}^{t}\beta_{1}^{t-k}$ . Therefore,

[TABLE]

where the second inequality is by Lemma 2.7. Therefore

[TABLE]

It is sufficient to consider $\sum_{t=1}^{T}\frac{1}{\sqrt{t}}\sum_{k=1}^{t}\gamma^{t-k}|{g_{k,i}}|$ . Firstly, $\sum_{t=1}^{T}\frac{1}{\sqrt{t}}\sum_{k=1}^{t}\gamma^{t-k}|{g_{k,i}}|$ can be expanded as

[TABLE]

Changing the role of $|g_{1,i}|$ as the common factor, we obtain

[TABLE]

In other words,

[TABLE]

Moreover, since

[TABLE]

where the last inequality is by Lemma 2.4, we obtain

[TABLE]

Furthermore, since

[TABLE]

where the first inequality is by Lemma 2.3 and the last inequality is by Lemma 2.5, we obtain

[TABLE]

Hence, by (4.4.1),

[TABLE]

which ends the proof. ∎

Let us now prove Theorem 4.1.

Proof of Theorem 4.1.

To prove Theorem 4.1, by Lemma 3.1, we need to bound the terms (3.1.3), (3.1.4), and (3.1.5). First, we consider (3.1.4). We have

[TABLE]

where the equality is by the assumption that $\alpha_{t}=\alpha/\sqrt{t}$ and the last inequality is by Lemma 4.4. Next, we consider (3.1.5). The bound for (3.1.5) depends on either $\beta_{1,t}=\beta_{1}\lambda^{t-1}(0<\lambda<1)$ or $\beta_{1,t}=\frac{\beta_{1}}{t}$ . Recall that by assumption, $\lVert{x_{m}-x_{n}}\rVert_{\infty}\leq D_{\infty}$ for any $m,n\in\{1,...,T\}$ , $\alpha_{t}=\alpha/\sqrt{t}$ . If $\beta_{1,t}=\beta_{1}\lambda^{t-1}(0<\lambda<1)$ , then,

[TABLE]

where the first inequality is from Lemma 4.2 and the assumption that $\beta_{1}\leq 1$ , the last inequality is by Lemma 2.4. If $\beta_{1,t}=\frac{\beta_{1}}{t}$ , then,

[TABLE]

where the first inequality is from Lemma 4.2 and the assumption that $\beta_{1}\leq 1$ , and the last inequality is by Lemma 2.6.

Finally, we will bound (3.1.3). From the inequality (3) and replacing $\alpha_{t}$ with $\frac{\alpha}{\sqrt{t}}(1\leq t\leq T)$ , we obtain

[TABLE]

By Lemma 4.3, there is some $t_{0}(1\leq t_{0}\leq T)$ such that $\frac{\sqrt{t\hat{v}_{t,i}}}{1-\beta_{1,t}}\geq\frac{\sqrt{(t-1)\hat{v}_{t-1,i}}}{1-\beta_{1,t-1}}$ for all $t>t_{0}$ . Therefore,

[TABLE]

Since

[TABLE]

we have

[TABLE]

where the second inequality is obtained by omitting the term $\frac{1}{2\alpha}\sum_{i=1}^{d}\sum_{t=2}^{t_{0}}(x_{t,i}-x^{*}_{,i})^{2}\frac{\sqrt{(t-1)\hat{v}_{t-1,i}}}{1-\beta_{1,t-1}}$ , and the last inequality is by Lemma 4.2 and the assumption that $\beta_{1,t}\leq\beta_{1}(1\leq t\leq T)$ . Summing up, if $\beta_{1,t}=\beta_{1}\lambda^{t-1}$ , then, from (4), (4), and (4), we obtain

[TABLE]

If $\beta_{1,t}=\frac{\beta_{1}}{t}$ , then, from from (4), (4), and (4), we obtain

[TABLE]

which ends the proof. ∎

The following corollary shows that, when either $\beta_{1,t}=\beta_{1}\lambda^{t-1}$ or $\beta_{1,t}=1/t$ , $(1\leq t\leq T)$ , where $0\leq\beta_{1}<1$ and $0<\lambda<1$ , the average regret of AMSGrad converges.

Corollary 4.5.

With the same assumption as in Theorem 4.1, AMSGrad achieves the following guarantee:

[TABLE]

Proof.

The result is obtained by using Theorem 4 and the following fact:

[TABLE]

where the inequality is from the assumption that $\lVert{g_{t}}\rVert_{\infty}\leq G_{\infty}$ for all $t\in[T]$ . ∎

5. New version of AMSGrad optimizer: AdamX

Let $f_{1},f_{2},...,f_{T}:\mathcal{F}\to\mathbb{R}$ be an arbitrary sequence of convex cost functions. If the system $\{\beta_{1,t}\}_{1\leq t\leq T}\}$ is kept arbitrary, as in the setting of Theorem A, to ensure that the regret $R(T)$ satisfies $R(T)/T\to 0$ , we suggest a new algorithm as follows.

With this Algorithm 2, the regret is bounded as follows.

Theorem 5.1.

Let $x_{t}$ and $v_{t}$ be the sequences obtained from Algorithm 2, $\alpha_{t}=\frac{\alpha}{\sqrt{t}}$ , $\beta_{1}=\beta_{1,1}$ , $\beta_{1,t}\leq\beta_{1}$ for all $t\in[T]$ and $\frac{\beta_{1}}{\sqrt{\beta_{2}}}\leq 1$ . Assume that $\mathcal{F}$ has bounded diameter $D_{\infty}$ and $\lVert{\nabla f_{t}(x)}\rVert_{\infty}\leq G_{\infty}$ for all $t\in[T]$ and $x\in\mathcal{F}$ . For $x_{t}$ generated using the AdamX (Algorithm 2), we have the following bound on the regret:

[TABLE]

To prove Theorem 5.1, we need the following Lemmas 5.2, 5.3, and 5.4.

Lemma 5.2.

For all $t\geq 1$ , we have

[TABLE]

where $\hat{v}_{t}$ is in Algorithm 2.

Proof.

We will prove (5.2.1) by induction on $t$ . Recall that by the update rule on $\hat{v}_{t}$ , we have $\hat{v}_{1}\overset{\Delta}{=}v_{1}$ and $\hat{v}_{t}\overset{\Delta}{=}\max\{\frac{(1-\beta_{1,t})^{2}}{(1-\beta_{1,t-1})^{2}}\hat{v}_{t-1},v_{t}\}$ if $t\geq 2$ . Therefore,

[TABLE]

Assume that

[TABLE]

and the (5.2.1) holds for all $1\leq j\leq t-1$ . Since

[TABLE]

we have

[TABLE]

which ends the proof. ∎

Lemma 5.3.

For all $t\geq 1$ , we have $\sqrt{\hat{v}_{t}}\leq\frac{G_{\infty}}{1-\beta_{1}}$ , where $\hat{v}_{t}$ is in Algorithm 2.

Proof.

By Lemma 5.2,

[TABLE]

Therefore, there is some $1\leq s\leq t$ such that $\hat{v}_{t}=\frac{(1-\beta_{1,t})^{2}}{(1-\beta_{1,s})^{2}}v_{s}$ . Hence,

[TABLE]

which ends the proof. ∎

Lemma 5.4.

For the parameter settings and conditions assumed in Theorem 5.1, we have

[TABLE]

Proof.

Since for all $t\geq 1$

[TABLE]

by Lemma 5.2, we have $\hat{v}_{t,i}\geq v_{t,i}$ , and hence the proof is the same as that of Lemma 4.4. ∎

Proof of Theorem 5.1.

Similarly to the proof of Theorem 4.1, we need to bound (3.1.3), (3.1.4), and (3.1.5). By using Lemma 5.4, we obtain the same bound for (3.1.4) as in the proof of Theorem 4.1, that is,

[TABLE]

where the last inequality is by Lemma 5.4. Now we bound (3.1.5). By the assumption that $\lVert{x_{m}-x_{n}}\rVert_{\infty}\leq D_{\infty}$ for any $m,n\in\{1,...,T\}$ , $\alpha_{t}=\alpha/\sqrt{t}$ , and $\beta_{1,t}=\beta_{1}\lambda^{t-1}\leq\beta_{1}\leq 1$ , we obtain

[TABLE]

Therefore, from Lemma 5.3, we obtain

[TABLE]

Finally, we will bound (3.1.3). By the inequality (3) and replacing $\alpha_{t}=\frac{\alpha}{\sqrt{t}}(1\leq t\leq T)$ , we obtain

[TABLE]

Moreover, by the update rule of Algorithm 2, we have

[TABLE]

Therefore, $\hat{v}_{t,i}\geq\frac{(1-\beta_{1,t})^{2}}{(1-\beta_{1,t-1})^{2}}\hat{v}_{t-1,i}$ , and hence

[TABLE]

Now by the positivity of the essential formula $\frac{\sqrt{t\hat{v}_{t,i}}}{1-\beta_{1,t}}-\frac{\sqrt{(t-1)\hat{v}_{t-1,i}}}{1-\beta_{1,t-1}}$ , we obtain

[TABLE]

where the last inequality is by Lemma 5.3. Hence we obtain the desired upper bound for $R(T)$ . ∎

Corollary 5.5.

With the same assumption as in Theorem 5.1, and for all $0\leq\beta_{1,t}<1$ satisfying

[TABLE]

AdamX achieves the following guarantee:

[TABLE]

Proof.

By Theorem 5.1, it is sufficient to consider the term

[TABLE]

on the right hand side of the upper bound for $R(T)$ in Theorem 5.1. Because $\frac{dD_{\infty}^{2}G_{\infty}}{2\alpha(1-\beta_{1})^{2}}$ is bounded and does not depend on $T$ , the statement follows. ∎

When either $\beta_{1,t}=\beta_{1}\lambda^{t-1}$ for some $\lambda\in(0,1)$ , or $\beta_{1,t}=\frac{1}{t}$ in Theorem 5.1, we obtain the following guarantee that the average regret of AdamX converges.

Corollary 5.6.

With the same assumption as in Theorem 5.1, and either $\beta_{1,t}=\beta_{1}\lambda^{t-1}$ for some $\lambda\in(0,1)$ , or $\beta_{1,t}=\frac{1}{t}$ , AdamX achieves the following guarantee:

[TABLE]

Proof.

By Corollary 5.5, it is sufficient to consider the term

[TABLE]

When $\beta_{1,t}=\beta_{1}\lambda^{t-1}$ for some $\lambda\in(0,1)$ , we have

[TABLE]

where the first inequality is from the property that $\beta_{1}\leq 1$ , and the last inequality is from Lemma 2.4. When $\beta_{1,t}=\frac{1}{t}$ , we obtain

[TABLE]

where the last inequality is from Lemma 2.6. Now, by combining (5) and (5) with Corollary 5.5, we obtain the desired result. ∎

6. Experiments

While we consider our main contributions as the theoretical analyses on AMSGrad and AdamX in the previous sections, we provide experimental results in this section for AMSGrad and AdamX. Concretely, we use the PyTorch code for AMSGrad333https://pytorch.org/docs/stable/_modules/torch/optim/adam.html via setting the boolean flag amsgrad = True. The code for AdamX is based on that of AMSGrad, with corresponding modifications as in Algorithm 2. The parameters for AMSGrad and AdamX are identical in our experiments, namely $(\beta_{1},\beta_{2})=(0.9,0.999)$ , the term added to the denominator to improve numerical stability is $\epsilon=10^{-8}$ , and and additionally we set $\beta_{1,t}=\beta_{1}\lambda^{t-1}$ with $\lambda=0.001$ to make use of Corollary 5.6 on the convergence of AdamX.

The learning rate is scheduled for both optimizers AMSGrad and AdamX as follows: $10^{-3}$ , $10^{-4}$ , $10^{-5}$ , $10^{-6}$ , $10^{-6}/2$ if the epoch is correspondingly in the ranges $[0,80]$ , $[81,120]$ , $[121,160]$ , $[161,180]$ , $[181,200]$ . We use CIFAR444https://www.cs.toronto.edu/ kriz/cifar.html-10 (containing 50000 training images and 10000 test images of size $32\times 32$ ) as the dataset and the residual networks ResNet18 [8] and PreActResNet18 [9] for training with batch size is 128. The testing result is given in Figure 1 where one can see that AMSGrad and AdamX behaves similarly, which supports our theoretical results on the convergence of both AMSGrad (Section 4) and AdamX (Section 5).

7. Conclusion

We have shown that the convergence proof of AMSGrad [3] is problematic, and presented various fixes for it, which include a new and slightly modified version called AdamX. Along the lines, we also observe that the issue has been neglected in various works such as in [2, Theorem 10.5], [5, Theorem 4], [6, Theorem 4.4], [7, Theorem 4.2]. Our work helps ensure the theoretical foundation of those optimizers.

Bibliography9

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Herbert Robbins and Sutton Monro, ”A stochastic approximation method”, The Annals of Mathematical Statistics , vol. 22, no. 3, 1951, pp. 400–407.
2[2] Diederik P. Kingma and Jimmy Ba. (2015). Adam: A method for stochastic optimization. Presented at International Conference on Learning Representations (ICLR) . [Online]. Available: https://arxiv.org/pdf/1412.6980.pdf
3[3] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. (2018). On the convergence of Adam and beyond. Presented at International Conference on Learning Representations (ICLR) . [Online]. Available: https://openreview.net/pdf?id=ry Qu 7f-RZ
4[4] H. Brendan Mc Mahan and Matthew Streeter, ”Adaptive Bound Optimization for Online Convex Optimization”, Proceedings of the 23rd Annual Conference. on. Learning Theory (COLT) , 2010, pp. 244 – 256, Also in Co RR , abs/1002.4908, 2010. [Online]. Available: https://arxiv.org/abs/1002.4908
5[5] Liangchen Luo, Yuanhao Xiong, and Yan Liu. (2019). Adaptive gradient methods with dynamic bound of learning rate. Present at International Conference on Learning Representations (ICLR) . [Online]. Available: https://openreview.net/pdf?id=Bkg 3g 2R 9FX
6[6] Sebastian Bock, Josef Goppold, and Martin Weiß, ”An improvement of the convergence proof of the Adam-optimizer”, Co RR , abs/1804.10587, 2018. [Online]. Available: https://arxiv.org/pdf/1804.10587.pdf
7[7] Jinghui Chen and Quanquan Gu, ”Closing the generalization gap of adaptive gradient methods in training deep neural networks”, Co RR , abs/1806.06763, 2018. [Online]. Available: https://arxiv.org/pdf/1806.06763.pdf
8[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ”Deep Residual Learning for Image Recognition”, CVPR 2016 , pp. 770–778, 2016. [Online]. Available: https://arxiv.org/abs/1512.03385

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

On the Convergence Proof of AMSGrad and a New Version⋆

Abstract.

Contents

1. Introduction and our contributions

Theorem A* (Theorem 4 in [3], problematic).*

2. Preliminaries

Definition 2.1**.**

Lemma 2.2**.**

Lemma 2.3** **(Cauchy–Schwarz inequality).

Lemma 2.4** **(Taylor series).

Lemma 2.5** **(Upper bound for the harmonic series).

Lemma 2.6**.**

Lemma 2.7**.**

Lemma 2.8**.**

3. Issue in the convergence proof of AMSGrad

Lemma 3.1**.**

Proof.

Example 3.2** (for AMSGrad convergence proof).**

4. New convergence theorem for AMSGrad

Theorem 4.1** (Fixes for Theorem A).**

Lemma 4.2**.**

Proof.

Lemma 4.3**.**

Proof.

Lemma 4.4**.**

Proof.

Proof of Theorem 4.1.

Corollary 4.5**.**

Proof.

5. New version of AMSGrad optimizer: AdamX

Theorem 5.1**.**

Lemma 5.2**.**

Proof.

Lemma 5.3**.**

Proof.

Lemma 5.4**.**

Proof.

Proof of Theorem 5.1.

Corollary 5.5**.**

Proof.

Corollary 5.6**.**

Proof.

6. Experiments

7. Conclusion

Definition 2.1.

Lemma 2.2.

Lemma 2.3 (Cauchy–Schwarz inequality).

Lemma 2.4 (Taylor series).

Lemma 2.5 (Upper bound for the harmonic series).

Lemma 2.6.

Lemma 2.7.

Lemma 2.8.

Lemma 3.1.

Example 3.2 (for AMSGrad convergence proof).

Theorem 4.1 (Fixes for Theorem A).

Lemma 4.2.

Lemma 4.3.

Lemma 4.4.

Corollary 4.5.

Theorem 5.1.

Lemma 5.2.

Lemma 5.3.

Lemma 5.4.

Corollary 5.5.

Corollary 5.6.