Online Learning-guided Learning Rate Adaptation via Gradient Alignment

Ruichen Jiang; Ali Kavis; Aryan Mokhtari

arXiv:2506.08419·cs.LG·June 11, 2025

Online Learning-guided Learning Rate Adaptation via Gradient Alignment

Ruichen Jiang, Ali Kavis, Aryan Mokhtari

PDF

Open Access 3 Reviews

TL;DR

This paper introduces GALA, a gradient alignment-based framework for adaptively tuning learning rates in deep learning, reducing the need for extensive hyperparameter search and improving optimizer robustness.

Contribution

GALA provides a principled, gradient alignment-based method for dynamic learning rate adaptation, supported by convergence analysis and empirical validation across various optimizers.

Findings

01

GALA adapts learning rates effectively across different initial settings.

02

Augmenting optimizers with GALA improves robustness and performance.

03

The method achieves competitive results without extensive hyperparameter tuning.

Abstract

The performance of an optimizer on large-scale deep learning models depends critically on fine-tuning the learning rate, often requiring an extensive grid search over base learning rates, schedules, and other hyperparameters. In this paper, we propose a principled framework called GALA (Gradient Alignment-based Learning rate Adaptation), which dynamically adjusts the learning rate by tracking the alignment between consecutive gradients and using a local curvature estimate. Guided by the convergence analysis, we formulate the problem of selecting the learning rate as a one-dimensional online learning problem. When paired with an online learning algorithm such as Follow-the-Regularized-Leader, our method produces a flexible, adaptive learning rate schedule that tends to increase when consecutive gradients are aligned and decrease otherwise. We establish a data-adaptive convergence rate…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper discusses the related work in details and attributes the ideas properly. 2. The setting for the experiments is reasonable, though limited to the vision datasets. At least I appreciate that the authors didn't stop at running CIFAR-10 evaluations. 3. The authors compared to quite a few other adaptive optimizers, which helps show the strength of the evaluations. 4. The practical implementation does not use many extra hyperparameters, and the method doesn't seem to be sensitive to th

Weaknesses

1. The assumption that the averages of Lipschitz constants are lower bounded is essentially requiring the eigenvalues of the Hessian matrix to be separated from 0, which means no saddle points. I think it is very strong and should be avoided if the authors are truly interested in the nonconvex optimization. As a result, I don't think combining Theorem 1 and Lemma 2 is a valid way to compare to the results of Cutkosky and Mehta (2020). Similarly, the resuts can't be compared to those of AdaGrad.

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper presents a principled intuition by casting learning-rate selection as a one-dimensional online learning problem with an explicit surrogate derived from the gradient alignment identity and a local curvature penalty (Eq. (5)), leading to clean, closed-form FTRL updates (Eq. (7)). 2. The theoretical analysis targets normalized SGD with momentum and clearly shows the dependence on an along-trajectory smoothness notion (Theorem 1), complemented by a regret bound (Lemma 2). 3. The algor

Weaknesses

Regarding the theoretical results: 1. Although not explicitly framed as assumptions, strong assumptions are required for Theorem 1's bound to converge. (1) Theorem 1’s convergence requires the online regret on a surrogate whose quadratic term is strongly convex if $L_t$ and $\tilde L_t$ admit uniform lower bounds, which is restrictive in stochastic nonconvex settings and stronger than assuming only $F$ is $L$-smooth. This basically requires the function $f(\cdot; \xi)$ to be strongly-convex, an

Reviewer 03Rating 6Confidence 4

Strengths

Authors are able to put together an intricate analysis combining many ideas (online to nonconvex ideas like Cutkosky et al, surrogate losses inspired by Zhuang et al, a hypergradient-like update rule viewed as an online learning algorithm for learning the step size, like in Gao et al and Chu et al do for convex losses...) resulting in provable rates for these kind of consecutive stochastic gradient aligment techniques for learning rate tuning, in the stochastic nonconvex smooth setting with boun

Weaknesses

The algorithm requires knowledge of the gradient estimates' variance as well as the final iteration T, via $\alpha$. The extended related work is mostly the same as the one in the main paper. This is too redundant. Also, hypergradient descent literature in the related work section in the appendix was not even mentioned in the main paper. Given how relevant and related this is to this paper, including the theory works of Gao et al (2024) and Chu et al (2025), I'd urge the authors to have this pa

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Bandit Algorithms Research

MethodsStochastic Gradient Descent · Global-and-Local attention · Balanced Selection · Adam