Online Learning-guided Learning Rate Adaptation via Gradient Alignment
Ruichen Jiang, Ali Kavis, Aryan Mokhtari

TL;DR
This paper introduces GALA, a gradient alignment-based framework for adaptively tuning learning rates in deep learning, reducing the need for extensive hyperparameter search and improving optimizer robustness.
Contribution
GALA provides a principled, gradient alignment-based method for dynamic learning rate adaptation, supported by convergence analysis and empirical validation across various optimizers.
Findings
GALA adapts learning rates effectively across different initial settings.
Augmenting optimizers with GALA improves robustness and performance.
The method achieves competitive results without extensive hyperparameter tuning.
Abstract
The performance of an optimizer on large-scale deep learning models depends critically on fine-tuning the learning rate, often requiring an extensive grid search over base learning rates, schedules, and other hyperparameters. In this paper, we propose a principled framework called GALA (Gradient Alignment-based Learning rate Adaptation), which dynamically adjusts the learning rate by tracking the alignment between consecutive gradients and using a local curvature estimate. Guided by the convergence analysis, we formulate the problem of selecting the learning rate as a one-dimensional online learning problem. When paired with an online learning algorithm such as Follow-the-Regularized-Leader, our method produces a flexible, adaptive learning rate schedule that tends to increase when consecutive gradients are aligned and decrease otherwise. We establish a data-adaptive convergence rate…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper discusses the related work in details and attributes the ideas properly. 2. The setting for the experiments is reasonable, though limited to the vision datasets. At least I appreciate that the authors didn't stop at running CIFAR-10 evaluations. 3. The authors compared to quite a few other adaptive optimizers, which helps show the strength of the evaluations. 4. The practical implementation does not use many extra hyperparameters, and the method doesn't seem to be sensitive to th
1. The assumption that the averages of Lipschitz constants are lower bounded is essentially requiring the eigenvalues of the Hessian matrix to be separated from 0, which means no saddle points. I think it is very strong and should be avoided if the authors are truly interested in the nonconvex optimization. As a result, I don't think combining Theorem 1 and Lemma 2 is a valid way to compare to the results of Cutkosky and Mehta (2020). Similarly, the resuts can't be compared to those of AdaGrad.
1. The paper presents a principled intuition by casting learning-rate selection as a one-dimensional online learning problem with an explicit surrogate derived from the gradient alignment identity and a local curvature penalty (Eq. (5)), leading to clean, closed-form FTRL updates (Eq. (7)). 2. The theoretical analysis targets normalized SGD with momentum and clearly shows the dependence on an along-trajectory smoothness notion (Theorem 1), complemented by a regret bound (Lemma 2). 3. The algor
Regarding the theoretical results: 1. Although not explicitly framed as assumptions, strong assumptions are required for Theorem 1's bound to converge. (1) Theorem 1’s convergence requires the online regret on a surrogate whose quadratic term is strongly convex if $L_t$ and $\tilde L_t$ admit uniform lower bounds, which is restrictive in stochastic nonconvex settings and stronger than assuming only $F$ is $L$-smooth. This basically requires the function $f(\cdot; \xi)$ to be strongly-convex, an
Authors are able to put together an intricate analysis combining many ideas (online to nonconvex ideas like Cutkosky et al, surrogate losses inspired by Zhuang et al, a hypergradient-like update rule viewed as an online learning algorithm for learning the step size, like in Gao et al and Chu et al do for convex losses...) resulting in provable rates for these kind of consecutive stochastic gradient aligment techniques for learning rate tuning, in the stochastic nonconvex smooth setting with boun
The algorithm requires knowledge of the gradient estimates' variance as well as the final iteration T, via $\alpha$. The extended related work is mostly the same as the one in the main paper. This is too redundant. Also, hypergradient descent literature in the related work section in the appendix was not even mentioned in the main paper. Given how relevant and related this is to this paper, including the theory works of Gao et al (2024) and Chu et al (2025), I'd urge the authors to have this pa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Bandit Algorithms Research
MethodsStochastic Gradient Descent · Global-and-Local attention · Balanced Selection · Adam
