AdaGrad under Anisotropic Smoothness

Yuxing Liu; Rui Pan; Tong Zhang

arXiv:2406.15244·cs.LG·October 15, 2024

AdaGrad under Anisotropic Smoothness

Yuxing Liu, Rui Pan, Tong Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new anisotropic smoothness assumption to analyze AdaGrad, demonstrating its faster convergence in large-scale deep learning tasks under realistic conditions, bridging the gap between theory and practice.

Contribution

It proposes a novel anisotropic smoothness framework and provides theoretical analysis showing AdaGrad's improved convergence guarantees in large batch settings.

Findings

01

AdaGrad achieves better dimensional dependence under anisotropic smoothness.

02

Experimental results support the new smoothness assumption and theoretical findings.

03

Analysis applies to logistic regression and instruction fine-tuning tasks.

Abstract

Adaptive gradient methods have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite the huge success in practice, their theoretical advantages over classical gradient methods with uniform step sizes across all coordinates (e.g. SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate this benefit was obtained in the original paper of Adagrad for convex nonsmooth objective functions, which is insufficient for large batch algorithms. In this work, we attempt to resolve this gap between theory and practice by proposing a novel anisotropic generalized smoothness assumption and providing corresponding analyses of Adagrad. It is shown that under anisotropic smoothness and noise conditions, AdaGrad can achieve faster…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The main strength lies in its novel anisotropic assumptions, which align well with AdaGrad’s observed performance in high-dimensional settings. The experiments effectively validate the theoretical benefits, highlighting AdaGrad’s adaptability to large batch sizes and diverse data structures. For the rest it is a standard optimization analysis.

Weaknesses

This kind of work always relies on assumptions which limits their applicability to the setting of interests, as neural networks. However, this is common and not really an issue. See also questions.

Reviewer 02Rating 8Confidence 4

Strengths

The work provides an analysis result which may be the first one for Adagrad. This can be helpful for others to understand the potential of Adagrad and select optimizers for training tasks.

Weaknesses

The numerical results are not sufficient to verify the assumptions and analytic results.

Reviewer 03Rating 6Confidence 5

Strengths

See summary

Weaknesses

See summary

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematical Approximation and Integration

MethodsAdaGrad · Logistic Regression · Stochastic Gradient Descent