Formal guarantees for heuristic optimization algorithms used in machine   learning

Xiaoyu Li

arXiv:2208.00502·cs.LG·August 2, 2022

Formal guarantees for heuristic optimization algorithms used in machine learning

Xiaoyu Li

PDF

Open Access

TL;DR

This paper provides the first formal theoretical guarantees for heuristic optimization methods like adaptive step sizes and momentum in machine learning, including convergence conditions and adaptivity to noise.

Contribution

It introduces new convergence analyses for heuristic methods such as Delayed AdaGrad, exponential and cosine step sizes, and momentum, filling gaps in their theoretical understanding.

Findings

01

Delayed AdaGrad achieves almost sure convergence under certain conditions.

02

Exponential and cosine step sizes have proven convergence guarantees in non-convex settings.

03

Last iterate of momentum methods can be optimal in convex stochastic optimization.

Abstract

Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization of machine learning (ML) problems. A variety of strategies have been proposed for tuning the step sizes, ranging from adaptive step sizes to heuristic methods to change the step size in each iteration. Also, momentum has been widely employed in ML tasks to accelerate the training process. Yet, there is a gap in our theoretical understanding of them. In this work, we start to close this gap by providing formal guarantees to a few heuristic optimization methods and proposing improved algorithms. First, we analyze a generalized version of the AdaGrad (Delayed AdaGrad) step sizes in both convex and non-convex settings, showing that these step sizes allow the algorithms to automatically adapt to the level of noise of the stochastic gradients. We show for the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Sparse and Compressive Sensing Techniques

MethodsStochastic Gradient Descent · AdaGrad