On the Convergence Proof of AMSGrad and a New Version
Tran Thi Phuong, Le Trieu Phong

TL;DR
This paper critically examines the convergence proofs of AMSGrad, identifies issues with hyper-parameter handling, and proposes fixes including a new version called AdamX, supported by theoretical analysis and experiments.
Contribution
It reveals flaws in the convergence proof of AMSGrad, provides a corrected proof, and introduces a new variant AdamX with empirical validation.
Findings
Convergence proof of AMSGrad is flawed due to hyper-parameter handling.
A corrected convergence proof for AMSGrad is provided.
The new AdamX algorithm outperforms previous variants on benchmark datasets.
Abstract
The adaptive moment estimation algorithm Adam (Kingma and Ba) is a popular optimizer in the training of deep neural networks. However, Reddi et al. have recently shown that the convergence proof of Adam is problematic and proposed a variant of Adam called AMSGrad as a fix. In this paper, we show that the convergence proof of AMSGrad is also problematic. Concretely, the problem in the convergence proof of AMSGrad is in handling the hyper-parameters, treating them as equal while they are not. This is also the neglected issue in the convergence proof of Adam. We provide an explicit counter-example of a simple convex optimization setting to show this neglected issue. Depending on manipulating the hyper-parameters, we present various fixes for this issue. We provide a new convergence proof for AMSGrad as the first fix. We also propose a new version of AMSGrad called AdamX as another fix. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAdam
On the Convergence Proof of AMSGrad and a New Version⋆
Tran Thi Phuong*(∗,∗∗,∗∗∗)*
Faculty of Mathematics and Statistics, Ton Duc Thang University, Ho Chi Minh City, Vietnam. Postal address: 19 Nguyen Huu Tho street, Tan Phong ward, District 7, Ho Chi Minh City, Vietnam.
Meiji University. Postal address: 1-1-1 Higashi-Mita, Tama-ku, Kawasaki-shi, Kanagawa 214-8571, Japan.
and
Le Trieu Phong*(∗∗∗)*
National Institute of Information and Communications Technology (NICT). Postal address: 4-2-1, Nukui-Kitamachi, Koganei, Tokyo 184-8795, Japan.
Abstract.
The adaptive moment estimation algorithm Adam (Kingma and Ba) is a popular optimizer in the training of deep neural networks. However, Reddi et al. have recently shown that the convergence proof of Adam is problematic and proposed a variant of Adam called AMSGrad as a fix. In this paper, we show that the convergence proof of AMSGrad is also problematic. Concretely, the problem in the convergence proof of AMSGrad is in handling the hyper-parameters, treating them as equal while they are not. This is also the neglected issue in the convergence proof of Adam. We provide an explicit counter-example of a simple convex optimization setting to show this neglected issue. Depending on manipulating the hyper-parameters, we present various fixes for this issue. We provide a new convergence proof for AMSGrad as the first fix. We also propose a new version of AMSGrad called AdamX as another fix. Our experiments on the benchmark dataset also support our theoretical results.
Key words and phrases. Optimizer, adaptive moment estimation, Adam, AMSGrad, deep neural networks.
⋆A version of this paper appears at IEEE Access DOI: 10.1109/ACCESS.2019.2916341
Contents
- 1 Introduction and our contributions
- 2 Preliminaries
- 3 Issue in the convergence proof of AMSGrad
- 4 New convergence theorem for AMSGrad
- 5 New version of AMSGrad optimizer: AdamX
- 6 Experiments
- 7 Conclusion
1. Introduction and our contributions
One of the most popular algorithms for training deep neural networks is stochastic gradient descent (SGD) [1] and its variants. Among the various variants of SGD, the algorithm with the adaptive moment estimation Adam [2] is widely used in practice. However, Reddi et al. [3] have recently shown that the convergence proof of Adam is problematic and proposed a variant of Adam called AMSGrad to solve this issue.
Our contribution. In this paper, we point out a flaw in the convergence proof of AMSGrad. We then fix this flaw by providing a new convergence proof for AMSGrad in the case of special parameters. In addition, in the case of general parameters, we propose a new and slightly modified version of AMSGrad.
To provide more details, let us recall AMSGrad in Algorithm 1, in which the mathematical notation can be fully found in Section 2.
The main theorem for the convergence of AMSGrad in [3] is as follows. To simplify the notation, we define , as the element of and as a vector that contains the dimension of the gradients over all iterations up to , namely, .
Theorem A* (Theorem 4 in [3], problematic).*
Let and be the sequences obtained from Algorithm 1, , , for all and . Assume that has bounded diameter and for all and . For generated using AMSGrad (Algorithm 1), we have the following bound on the regret:
[TABLE]
In their proof for Theorem A, Reddi et al. resolved an issue on the so-called telescopic sum in the convergence proof of Adam ([2, Theorem 10.5]). Specifically, Reddi et al. adjusted such that all components in the vector
[TABLE]
are always positive. However, there is another issue (showed in Section 3) in the convergence proof of Adam that AMSGrad unfortunately neglects. The issue affects both the correctness of Reddi et al.’s proof and the upper bound for the regret in Theorem A. To deal with the issue in a general way, we propose to modify Algorithm 1 such that all components in the vector
[TABLE]
are always positive. The differences with (1.0.1) are highlighted in the boxes for clarity.
Paper roadmap. We begin with preliminaries in Section 2. We show where the proof of Theorem A becomes invalid in Section 3. After that, we suggest two ways to resolve the issue in Sections 4 and 5.
Subsequent works. The first version of this paper publicly appeared on arXiv on 7 April 2019. On 19 April 2019, Reddi et al. revised their original proofs111https://arxiv.org/abs/1904.09237. The revised proof does not suffer from the issue pointed out in Section 3 of this paper, although yielding a constant factor missing in the original claims.
2. Preliminaries
Notation. Given a sequence of vectors in , we denote its coordinate by and use to denote the elementwise power of and , resp. , to denote its -norm, resp. -norm. Let be a feasible set of points such that has bounded diameter , that is, for all , and denote the set of all positive definite matrices. For a matrix , we denote for the square root of . The projection operation for is defined as for all . When and , the positive definite matrix is a positive number, so that the projection becomes . We use to denote the inner product between and . The gradient of a function evaluated at is denoted by . For vectors , we use or for element-wise square root, for element-wise square, to denote element-wise division. For an integer , we denote by the set of integers .
Optimization setup. Let be an arbitrary sequence of convex cost functions and . At each time , the goal is to predict the parameter and evaluate it on a previously unknown cost function . Since the nature of the sequence is unknown in advance, the algorithm is evaluated by using the regret, that is, the sum of all the previous differences between the online prediction and the best fixed-point parameter from a feasible set for all the previous steps. Concretely, the regret is defined as
[TABLE]
where .
Definition 2.1**.**
A function is convex if for all , and all ,
[TABLE]
Lemma 2.2**.**
If a function is convex, then for all ,
[TABLE]
where denotes the transpose of .
Lemma 2.3** **(Cauchy–Schwarz inequality).
For all , ,
[TABLE]
Lemma 2.4** **(Taylor series).
For and ,
[TABLE]
and
[TABLE]
Lemma 2.5** **(Upper bound for the harmonic series).
For ,
[TABLE]
Lemma 2.6**.**
For ,
[TABLE]
Lemma 2.7**.**
For all and such that and for all ,
[TABLE]
Lemma 2.8**.**
[3, Lemma 3 in arXiv version]** For any and convex feasible set , suppose and . Then, we have
[TABLE]
3. Issue in the convergence proof of AMSGrad
Before showing the issue in the convergence proof of AMSGrad, let us recall and prove the following inequality, which also appears in [3].
Lemma 3.1**.**
Algorithm 1 achieves the following guarantee, for all :
[TABLE]
Proof.
We note that
[TABLE]
and for all . For all , put . Using Lemma 2.8 with and , we have
[TABLE]
This yields
[TABLE]
Therefore, we obtain
[TABLE]
Moreover, by Lemma 2.2, we have , where denotes the transpose of vector . This means that
[TABLE]
Hence,
[TABLE]
Combining (3) with (3.1.2), we obtain
[TABLE]
On the other hand, for all , we have
[TABLE]
where the inequality is from the fact that for any . Hence,
[TABLE]
Therefore, we obtain
[TABLE]
Since , we obtain
[TABLE]
Moreover, we have
[TABLE]
where the last inequality is from the assumption that . Therefore,
[TABLE]
and we obtain the desired bound for . ∎
Issue in the convergence proof of AMSGrad. We denote the terms on the right hand-side of the upper bound for in Lemma 3.1 as
[TABLE]
[TABLE]
and
[TABLE]
The issue in the proof of the convergence theorem of AMSGrad [3, Theorem 4] becomes on examining the term (3.1.3). Indeed, in [3, page 18], Reddi et al. used222Concretely, on page 18 of [3], it is stated that “The […] inequality use the fact that .” the property that , and hence
[TABLE]
to replace all by as
[TABLE]
However, the first inequality (in red) is not guaranteed because the quantity
[TABLE]
in (3.1.3) may be both negative and positive as shown in Example 3.2. This is also a neglected issue in the convergence proofs in Kingma and Ba [2, Theorem 10.5], Luo et al. [5, Theorem 4], Bock et al. [6, Theorem 4.4], and Chen and Gu [7, Theorem 4.2].
Example 3.2** (for AMSGrad convergence proof).**
We use the function in the Synthetic Experiment of Reddi et al. [3, Page 6]
[TABLE]
with the constraint set . The optimal solution is . By the proof of [3, Theorem 1], the initial point . By Algorithm 1, , , and . We choose , , where , , and , where . Under this setting, we have , , and hence
[TABLE]
Therefore,
[TABLE]
Since , we have
[TABLE]
Hence,
[TABLE]
At , we have
[TABLE]
Therefore,
[TABLE]
Since , we obtain
[TABLE]
Hence,
[TABLE]
Outline of our solution. Let us rewrite (3.1.3) as
[TABLE]
Omitting the term , we obtain
[TABLE]
in which the differences with Reddi et al. [3] are highlighted in the boxes, namely, and instead of .
We suggest two ways to overcome these differences depending on the setting of :
- •
**In Section 4: ** If either or , , where and , then we give a new convergence theorem for AMSGrad in Section 4.
- •
In Section 5: If the setting for is general, as in the statement of Theorem A, then we suggest a new (slightly modified) version for AMSGrad in Section 5.
4. New convergence theorem for AMSGrad
When either or , , where and , Theorem A can be fixed as follows, in which the upper bounds of the regret are changed.
Theorem 4.1** (Fixes for Theorem A).**
Let and be the sequences obtained from Algorithm 1, , either , where , or for all and . Assume that has bounded diameter and for all and . For generated using AMSGrad (Algorithm 1), we have the following bound on the regret. Then, there is some such that AMSGrad achieves the following guarantee for all :
[TABLE]
provided , and
[TABLE]
provided .
To prove Theorem 4.1, we need the following Lemmas 4.2, 4.3, and 4.4.
Lemma 4.2**.**
.
Proof.
From the definition of in AMSGrad’s algorithm, it is implied that . Therefore, there is some such that . Hence,
[TABLE]
where the last inequality is by Lemma 2.4. ∎
Lemma 4.3**.**
If either or , then there exists some such that for every ,
[TABLE]
Proof.
Since , it is sufficient to prove that there exists some such that for every ,
[TABLE]
In other word,
[TABLE]
When , from (4.3.1) we have
[TABLE]
When , (4.3.1) have the following form
[TABLE]
Since and are smaller than , it is easy to see that when is sufficiently large, meaning that for some , the left-hand side of (4.3.2) is and the left-hand side of (4.3.3) is larger than . Therefore, (4.3.2) and (4.3.3) hold when is sufficiently large. ∎
Lemma 4.4**.**
For the parameter settings and conditions assumed in Theorem 4.1, we have
[TABLE]
Proof.
The proof is almost identical to that of [3, Lemma 2]. Since for all , , we have
[TABLE]
where the second inequality is by Lemma 2.3, the third inequality is from the properties of and for all , and the fourth inequality is obtained by applying Lemma 2.4 to . Therefore,
[TABLE]
where the second inequality is by Lemma 2.7. Therefore
[TABLE]
It is sufficient to consider . Firstly, can be expanded as
[TABLE]
Changing the role of as the common factor, we obtain
[TABLE]
In other words,
[TABLE]
Moreover, since
[TABLE]
where the last inequality is by Lemma 2.4, we obtain
[TABLE]
Furthermore, since
[TABLE]
where the first inequality is by Lemma 2.3 and the last inequality is by Lemma 2.5, we obtain
[TABLE]
Hence, by (4.4.1),
[TABLE]
which ends the proof. ∎
Let us now prove Theorem 4.1.
Proof of Theorem 4.1.
To prove Theorem 4.1, by Lemma 3.1, we need to bound the terms (3.1.3), (3.1.4), and (3.1.5). First, we consider (3.1.4). We have
[TABLE]
where the equality is by the assumption that and the last inequality is by Lemma 4.4. Next, we consider (3.1.5). The bound for (3.1.5) depends on either or . Recall that by assumption, for any , . If , then,
[TABLE]
where the first inequality is from Lemma 4.2 and the assumption that , the last inequality is by Lemma 2.4. If , then,
[TABLE]
where the first inequality is from Lemma 4.2 and the assumption that , and the last inequality is by Lemma 2.6.
Finally, we will bound (3.1.3). From the inequality (3) and replacing with , we obtain
[TABLE]
By Lemma 4.3, there is some such that for all . Therefore,
[TABLE]
Since
[TABLE]
we have
[TABLE]
where the second inequality is obtained by omitting the term , and the last inequality is by Lemma 4.2 and the assumption that . Summing up, if , then, from (4), (4), and (4), we obtain
[TABLE]
If , then, from from (4), (4), and (4), we obtain
[TABLE]
which ends the proof. ∎
The following corollary shows that, when either or , , where and , the average regret of AMSGrad converges.
Corollary 4.5**.**
With the same assumption as in Theorem 4.1, AMSGrad achieves the following guarantee:
[TABLE]
Proof.
The result is obtained by using Theorem 4 and the following fact:
[TABLE]
where the inequality is from the assumption that for all . ∎
5. New version of AMSGrad optimizer: AdamX
Let be an arbitrary sequence of convex cost functions. If the system is kept arbitrary, as in the setting of Theorem A, to ensure that the regret satisfies , we suggest a new algorithm as follows.
With this Algorithm 2, the regret is bounded as follows.
Theorem 5.1**.**
Let and be the sequences obtained from Algorithm 2, , , for all and . Assume that has bounded diameter and for all and . For generated using the AdamX (Algorithm 2), we have the following bound on the regret:
[TABLE]
To prove Theorem 5.1, we need the following Lemmas 5.2, 5.3, and 5.4.
Lemma 5.2**.**
For all , we have
[TABLE]
where is in Algorithm 2.
Proof.
We will prove (5.2.1) by induction on . Recall that by the update rule on , we have and if . Therefore,
[TABLE]
Assume that
[TABLE]
and the (5.2.1) holds for all . Since
[TABLE]
we have
[TABLE]
which ends the proof. ∎
Lemma 5.3**.**
For all , we have , where is in Algorithm 2.
Proof.
By Lemma 5.2,
[TABLE]
Therefore, there is some such that . Hence,
[TABLE]
which ends the proof. ∎
Lemma 5.4**.**
For the parameter settings and conditions assumed in Theorem 5.1, we have
[TABLE]
Proof.
Since for all
[TABLE]
by Lemma 5.2, we have , and hence the proof is the same as that of Lemma 4.4. ∎
Proof of Theorem 5.1.
Similarly to the proof of Theorem 4.1, we need to bound (3.1.3), (3.1.4), and (3.1.5). By using Lemma 5.4, we obtain the same bound for (3.1.4) as in the proof of Theorem 4.1, that is,
[TABLE]
where the last inequality is by Lemma 5.4. Now we bound (3.1.5). By the assumption that for any , , and , we obtain
[TABLE]
Therefore, from Lemma 5.3, we obtain
[TABLE]
Finally, we will bound (3.1.3). By the inequality (3) and replacing , we obtain
[TABLE]
Moreover, by the update rule of Algorithm 2, we have
[TABLE]
Therefore, , and hence
[TABLE]
Now by the positivity of the essential formula , we obtain
[TABLE]
where the last inequality is by Lemma 5.3. Hence we obtain the desired upper bound for . ∎
Corollary 5.5**.**
With the same assumption as in Theorem 5.1, and for all satisfying
[TABLE]
AdamX achieves the following guarantee:
[TABLE]
Proof.
By Theorem 5.1, it is sufficient to consider the term
[TABLE]
on the right hand side of the upper bound for in Theorem 5.1. Because is bounded and does not depend on , the statement follows. ∎
When either for some , or in Theorem 5.1, we obtain the following guarantee that the average regret of AdamX converges.
Corollary 5.6**.**
With the same assumption as in Theorem 5.1, and either for some , or , AdamX achieves the following guarantee:
[TABLE]
Proof.
By Corollary 5.5, it is sufficient to consider the term
[TABLE]
When for some , we have
[TABLE]
where the first inequality is from the property that , and the last inequality is from Lemma 2.4. When , we obtain
[TABLE]
where the last inequality is from Lemma 2.6. Now, by combining (5) and (5) with Corollary 5.5, we obtain the desired result. ∎
6. Experiments
While we consider our main contributions as the theoretical analyses on AMSGrad and AdamX in the previous sections, we provide experimental results in this section for AMSGrad and AdamX. Concretely, we use the PyTorch code for AMSGrad333https://pytorch.org/docs/stable/_modules/torch/optim/adam.html via setting the boolean flag amsgrad = True. The code for AdamX is based on that of AMSGrad, with corresponding modifications as in Algorithm 2. The parameters for AMSGrad and AdamX are identical in our experiments, namely , the term added to the denominator to improve numerical stability is , and and additionally we set with to make use of Corollary 5.6 on the convergence of AdamX.
The learning rate is scheduled for both optimizers AMSGrad and AdamX as follows: , , , , if the epoch is correspondingly in the ranges , , , , . We use CIFAR444https://www.cs.toronto.edu/ kriz/cifar.html-10 (containing 50000 training images and 10000 test images of size ) as the dataset and the residual networks ResNet18 [8] and PreActResNet18 [9] for training with batch size is 128. The testing result is given in Figure 1 where one can see that AMSGrad and AdamX behaves similarly, which supports our theoretical results on the convergence of both AMSGrad (Section 4) and AdamX (Section 5).
7. Conclusion
We have shown that the convergence proof of AMSGrad [3] is problematic, and presented various fixes for it, which include a new and slightly modified version called AdamX. Along the lines, we also observe that the issue has been neglected in various works such as in [2, Theorem 10.5], [5, Theorem 4], [6, Theorem 4.4], [7, Theorem 4.2]. Our work helps ensure the theoretical foundation of those optimizers.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Herbert Robbins and Sutton Monro, ”A stochastic approximation method”, The Annals of Mathematical Statistics , vol. 22, no. 3, 1951, pp. 400–407.
- 2[2] Diederik P. Kingma and Jimmy Ba. (2015). Adam: A method for stochastic optimization. Presented at International Conference on Learning Representations (ICLR) . [Online]. Available: https://arxiv.org/pdf/1412.6980.pdf
- 3[3] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. (2018). On the convergence of Adam and beyond. Presented at International Conference on Learning Representations (ICLR) . [Online]. Available: https://openreview.net/pdf?id=ry Qu 7f-RZ
- 4[4] H. Brendan Mc Mahan and Matthew Streeter, ”Adaptive Bound Optimization for Online Convex Optimization”, Proceedings of the 23rd Annual Conference. on. Learning Theory (COLT) , 2010, pp. 244 – 256, Also in Co RR , abs/1002.4908, 2010. [Online]. Available: https://arxiv.org/abs/1002.4908
- 5[5] Liangchen Luo, Yuanhao Xiong, and Yan Liu. (2019). Adaptive gradient methods with dynamic bound of learning rate. Present at International Conference on Learning Representations (ICLR) . [Online]. Available: https://openreview.net/pdf?id=Bkg 3g 2R 9FX
- 6[6] Sebastian Bock, Josef Goppold, and Martin Weiß, ”An improvement of the convergence proof of the Adam-optimizer”, Co RR , abs/1804.10587, 2018. [Online]. Available: https://arxiv.org/pdf/1804.10587.pdf
- 7[7] Jinghui Chen and Quanquan Gu, ”Closing the generalization gap of adaptive gradient methods in training deep neural networks”, Co RR , abs/1806.06763, 2018. [Online]. Available: https://arxiv.org/pdf/1806.06763.pdf
- 8[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ”Deep Residual Learning for Image Recognition”, CVPR 2016 , pp. 770–778, 2016. [Online]. Available: https://arxiv.org/abs/1512.03385
