Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method
Aaron Defazio, Baoyu Zhou, Lin Xiao

TL;DR
Grad-GradaGrad introduces a non-monotone adaptive stochastic gradient method that allows the learning rate to both increase and decrease, overcoming AdaGrad's limitation of only decreasing step sizes over time.
Contribution
It proposes a novel adaptive gradient method that dynamically adjusts the learning rate in both directions, enhancing flexibility and potentially improving convergence.
Findings
Achieves similar convergence rate as AdaGrad
Demonstrates non-monotone adaptation in experiments
Shows improved flexibility in learning rate adjustment
Abstract
The classical AdaGrad method adapts the learning rate by dividing by the square root of a sum of squared gradients. Because this sum on the denominator is increasing, the method can only decrease step sizes over time, and requires a learning rate scaling hyper-parameter to be carefully tuned. To overcome this restriction, we introduce GradaGrad, a method in the same family that naturally grows or shrinks the learning rate based on a different accumulation in the denominator, one that can both increase and decrease. We show that it obeys a similar convergence rate as AdaGrad and demonstrate its non-monotone adaptation capability with experiments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Advanced Adaptive Filtering Techniques
MethodsAdaGrad
