AdaX: Adaptive Gradient Descent with Exponential Long Term Memory
Wenjie Li, Zhaoyang Zhang, Xinjiang Wang, Ping Luo

TL;DR
AdaX is a new adaptive gradient descent algorithm that improves upon Adam by exponentially accumulating long-term gradient information, leading to better convergence and performance in machine learning tasks.
Contribution
The paper introduces AdaX, a novel optimizer that addresses Adam's limitations by incorporating long-term gradient memory, with proven convergence and superior empirical results.
Findings
AdaX outperforms Adam in vision and NLP tasks.
AdaX converges faster and more reliably than Adam.
AdaX matches the performance of SGD in various tasks.
Abstract
Although adaptive optimization algorithms such as Adam show fast convergence in many machine learning tasks, this paper identifies a problem of Adam by analyzing its performance in a simple non-convex synthetic problem, showing that Adam's fast convergence would possibly lead the algorithm to local minimums. To address this problem, we improve Adam by proposing a novel adaptive gradient descent algorithm named AdaX. Unlike Adam that ignores the past gradients, AdaX exponentially accumulates the long-term gradient information in the past during training, to adaptively tune the learning rate. We thoroughly prove the convergence of AdaX in both the convex and non-convex settings. Extensive experiments show that AdaX outperforms Adam in various tasks of computer vision and natural language processing and can catch up with Stochastic Gradient Descent.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Medical Image Segmentation Techniques
MethodsAdam
