AdaGrad under Anisotropic Smoothness
Yuxing Liu, Rui Pan, Tong Zhang

TL;DR
This paper introduces a new anisotropic smoothness assumption to analyze AdaGrad, demonstrating its faster convergence in large-scale deep learning tasks under realistic conditions, bridging the gap between theory and practice.
Contribution
It proposes a novel anisotropic smoothness framework and provides theoretical analysis showing AdaGrad's improved convergence guarantees in large batch settings.
Findings
AdaGrad achieves better dimensional dependence under anisotropic smoothness.
Experimental results support the new smoothness assumption and theoretical findings.
Analysis applies to logistic regression and instruction fine-tuning tasks.
Abstract
Adaptive gradient methods have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite the huge success in practice, their theoretical advantages over classical gradient methods with uniform step sizes across all coordinates (e.g. SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate this benefit was obtained in the original paper of Adagrad for convex nonsmooth objective functions, which is insufficient for large batch algorithms. In this work, we attempt to resolve this gap between theory and practice by proposing a novel anisotropic generalized smoothness assumption and providing corresponding analyses of Adagrad. It is shown that under anisotropic smoothness and noise conditions, AdaGrad can achieve faster…
Peer Reviews
Decision·ICLR 2025 Poster
The main strength lies in its novel anisotropic assumptions, which align well with AdaGrad’s observed performance in high-dimensional settings. The experiments effectively validate the theoretical benefits, highlighting AdaGrad’s adaptability to large batch sizes and diverse data structures. For the rest it is a standard optimization analysis.
This kind of work always relies on assumptions which limits their applicability to the setting of interests, as neural networks. However, this is common and not really an issue. See also questions.
The work provides an analysis result which may be the first one for Adagrad. This can be helpful for others to understand the potential of Adagrad and select optimizers for training tasks.
The numerical results are not sufficient to verify the assumptions and analytic results.
See summary
See summary
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematical Approximation and Integration
MethodsAdaGrad · Logistic Regression · Stochastic Gradient Descent
