Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

Yuanzhe Tao; Huizhuo Yuan; Xun Zhou; Yuan Cao; Quanquan Gu

arXiv:2412.19444·cs.LG·December 30, 2024

Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

Yuanzhe Tao, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces AdaGrad++ and Adam++, simple parameter-free variants of AdaGrad and Adam that come with formal convergence guarantees and perform competitively in deep learning tasks.

Contribution

It proposes novel, simple, parameter-free optimization algorithms with proven convergence guarantees for convex and deep learning applications.

Findings

01

AdaGrad++ achieves convergence rates comparable to AdaGrad.

02

Adam++ matches Adam's convergence rate without learning rate conditions.

03

Experimental results show competitive performance across deep learning tasks.

Abstract

Optimization algorithms such as AdaGrad and Adam have significantly advanced the training of deep models by dynamically adjusting the learning rate during the optimization process. However, adhoc tuning of learning rates poses a challenge, leading to inefficiencies in practice. To address this issue, recent research has focused on developing "learning-rate-free" or "parameter-free" algorithms that operate effectively without the need for learning rate tuning. Despite these efforts, existing parameter-free variants of AdaGrad and Adam tend to be overly complex and/or lack formal convergence guarantees. In this paper, we present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees. We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

The problem of finding optimizers that work out of the box and do not require tuning is important and relevant to the ICLR community. The resulting algorithm is indeed simple, and the presentation of the algorithms is clear.

Weaknesses

My main issues with the submissions are that the theoretical guarantees appear weak, and that the motivation for the proposed method is unclear. **Theoretical guarantees are weak,** or I have a very hard time understanding them. The submission claims that the rates in Thm 4.2 and 5.1 are at the worst-case sublinear rate because $\vert s \vert$ grows in $\sqrt{T}$ in the worst case, but this seems to ignore the dependence on other quantities that depend on $t$, such as the maximum distance t

Reviewer 02Rating 6Confidence 4

Strengths

- clear statement and valuable proofs - generally good empirical results

Weaknesses

- no empirical assessment of the robustness of the proposed algorithms against parameter changes (or rescalings). - proofs say very little about the parameter free nature of the proposed algorithms (too hard maybe?). - no clear discussion of the optimization effects vs the implicit regularization effects. - (minor) lots of distracting typos "entry-wisely" "AdaGard"

Reviewer 03Rating 4Confidence 4

Strengths

- Their proposed optimizers can be implemented very easily by slightly modifying the original AdaGrad and Adam. - Presentation of the paper is clear and easy to read. - Although existing parameter-free optimizers tend to underperform well-tuned Adam, Adam++ seems to work well, showing competitive performance with the original Adam without learning-rate tuning. - As far as I checked, there were no mathematical flaws in the proofs of their theorems.

Weaknesses

### Experimental settings are limited to image classification I would like to see results in other settings, such as language modeling, reinforcement learning, generative modeling (e.g., diffusion models), etc. ### Lack of Theoretical Results in Non-convex settings Theoretical analysis is only provided for convex settings, but there are a lot of convergence analysis of Adam/AdaGrad in smooth non-convex settings. These optimizers are mainly used for deep learning models, whose objective functi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Optimization Algorithms Research · Numerical methods in inverse problems · Radiative Heat Transfer Studies

MethodsAdam · AdaGrad