Towards Simple and Provable Parameter-Free Adaptive Gradient Methods
Yuanzhe Tao, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu

TL;DR
This paper introduces AdaGrad++ and Adam++, simple parameter-free variants of AdaGrad and Adam that come with formal convergence guarantees and perform competitively in deep learning tasks.
Contribution
It proposes novel, simple, parameter-free optimization algorithms with proven convergence guarantees for convex and deep learning applications.
Findings
AdaGrad++ achieves convergence rates comparable to AdaGrad.
Adam++ matches Adam's convergence rate without learning rate conditions.
Experimental results show competitive performance across deep learning tasks.
Abstract
Optimization algorithms such as AdaGrad and Adam have significantly advanced the training of deep models by dynamically adjusting the learning rate during the optimization process. However, adhoc tuning of learning rates poses a challenge, leading to inefficiencies in practice. To address this issue, recent research has focused on developing "learning-rate-free" or "parameter-free" algorithms that operate effectively without the need for learning rate tuning. Despite these efforts, existing parameter-free variants of AdaGrad and Adam tend to be overly complex and/or lack formal convergence guarantees. In this paper, we present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees. We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly,…
Peer Reviews
Decision·Submitted to ICLR 2026
The problem of finding optimizers that work out of the box and do not require tuning is important and relevant to the ICLR community. The resulting algorithm is indeed simple, and the presentation of the algorithms is clear.
My main issues with the submissions are that the theoretical guarantees appear weak, and that the motivation for the proposed method is unclear. **Theoretical guarantees are weak,** or I have a very hard time understanding them. The submission claims that the rates in Thm 4.2 and 5.1 are at the worst-case sublinear rate because $\vert s \vert$ grows in $\sqrt{T}$ in the worst case, but this seems to ignore the dependence on other quantities that depend on $t$, such as the maximum distance t
- clear statement and valuable proofs - generally good empirical results
- no empirical assessment of the robustness of the proposed algorithms against parameter changes (or rescalings). - proofs say very little about the parameter free nature of the proposed algorithms (too hard maybe?). - no clear discussion of the optimization effects vs the implicit regularization effects. - (minor) lots of distracting typos "entry-wisely" "AdaGard"
- Their proposed optimizers can be implemented very easily by slightly modifying the original AdaGrad and Adam. - Presentation of the paper is clear and easy to read. - Although existing parameter-free optimizers tend to underperform well-tuned Adam, Adam++ seems to work well, showing competitive performance with the original Adam without learning-rate tuning. - As far as I checked, there were no mathematical flaws in the proofs of their theorems.
### Experimental settings are limited to image classification I would like to see results in other settings, such as language modeling, reinforcement learning, generative modeling (e.g., diffusion models), etc. ### Lack of Theoretical Results in Non-convex settings Theoretical analysis is only provided for convex settings, but there are a lot of convergence analysis of Adam/AdaGrad in smooth non-convex settings. These optimizers are mainly used for deep learning models, whose objective functi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Optimization Algorithms Research · Numerical methods in inverse problems · Radiative Heat Transfer Studies
MethodsAdam · AdaGrad
