Function Value Learning: Adaptive Learning Rates Based on the Polyak Stepsize and Function Splitting in ERM
Guillaume Garrigos, Robert M. Gower, Fabian Schaipp

TL;DR
This paper introduces adaptive SGD variants using Polyak stepsize and function splitting, with a focus on empirical risk minimization, but finds limited practical advantages over standard SGD.
Contribution
Develops the $ exttt{SPS}_+$ and $ exttt{FUVAL}$ methods, extending Polyak stepsize adaptive techniques to ERM with new analysis approaches.
Findings
$ exttt{SPS}_+$ achieves best known convergence rates for non-smooth Lipschitz problems.
$ exttt{FUVAL}$ can be viewed as a projection, prox-linear, and online SGD method.
Full batch $ exttt{FUVAL}$ shows minor advantages over GD, stochastic version does not outperform SGD.
Abstract
Here we develop variants of SGD (stochastic gradient descent) with an adaptive step size that make use of the sampled loss values. In particular, we focus on solving a finite sum-of-terms problem, also known as empirical risk minimization. We first detail an idealized adaptive method called that makes use of the sampled loss values and assumes knowledge of the sampled loss at optimality. This is a minor modification of the SPS (Stochastic Polyak Stepsize) method, where the step size is enforced to be positive. We then show that achieves the best known rates of convergence for SGD in the Lipschitz non-smooth. We then move onto to develop , a variant of where the loss values at optimality are gradually learned, as opposed to being given. We give three viewpoints of , as a projection based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Domain Adaptation and Few-Shot Learning
MethodsFocus · Semi-Pseudo-Label · Stochastic Gradient Descent
