Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method

Aaron Defazio; Baoyu Zhou; Lin Xiao

arXiv:2206.06900·cs.LG·June 15, 2022

Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method

Aaron Defazio, Baoyu Zhou, Lin Xiao

PDF

Open Access

TL;DR

Grad-GradaGrad introduces a non-monotone adaptive stochastic gradient method that allows the learning rate to both increase and decrease, overcoming AdaGrad's limitation of only decreasing step sizes over time.

Contribution

It proposes a novel adaptive gradient method that dynamically adjusts the learning rate in both directions, enhancing flexibility and potentially improving convergence.

Findings

01

Achieves similar convergence rate as AdaGrad

02

Demonstrates non-monotone adaptation in experiments

03

Shows improved flexibility in learning rate adjustment

Abstract

The classical AdaGrad method adapts the learning rate by dividing by the square root of a sum of squared gradients. Because this sum on the denominator is increasing, the method can only decrease step sizes over time, and requires a learning rate scaling hyper-parameter to be carefully tuned. To overcome this restriction, we introduce GradaGrad, a method in the same family that naturally grows or shrinks the learning rate based on a different accumulation in the denominator, one that can both increase and decrease. We show that it obeys a similar convergence rate as AdaGrad and demonstrate its non-monotone adaptation capability with experiments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Advanced Adaptive Filtering Techniques

MethodsAdaGrad