Formal guarantees for heuristic optimization algorithms used in machine learning
Xiaoyu Li

TL;DR
This paper provides the first formal theoretical guarantees for heuristic optimization methods like adaptive step sizes and momentum in machine learning, including convergence conditions and adaptivity to noise.
Contribution
It introduces new convergence analyses for heuristic methods such as Delayed AdaGrad, exponential and cosine step sizes, and momentum, filling gaps in their theoretical understanding.
Findings
Delayed AdaGrad achieves almost sure convergence under certain conditions.
Exponential and cosine step sizes have proven convergence guarantees in non-convex settings.
Last iterate of momentum methods can be optimal in convex stochastic optimization.
Abstract
Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization of machine learning (ML) problems. A variety of strategies have been proposed for tuning the step sizes, ranging from adaptive step sizes to heuristic methods to change the step size in each iteration. Also, momentum has been widely employed in ML tasks to accelerate the training process. Yet, there is a gap in our theoretical understanding of them. In this work, we start to close this gap by providing formal guarantees to a few heuristic optimization methods and proposing improved algorithms. First, we analyze a generalized version of the AdaGrad (Delayed AdaGrad) step sizes in both convex and non-convex settings, showing that these step sizes allow the algorithms to automatically adapt to the level of noise of the stochastic gradients. We show for the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Sparse and Compressive Sensing Techniques
MethodsStochastic Gradient Descent · AdaGrad
