The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning   Rate Procedure For Least Squares

Rong Ge; Sham M. Kakade; Rahul Kidambi; Praneeth Netrapalli

arXiv:1904.12838·cs.LG·October 30, 2019·66 cites

The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

Rong Ge, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli

PDF

Open Access 1 Repo

TL;DR

This paper investigates the final iterate behavior of SGD in streaming least squares regression, showing that geometrically decaying step sizes significantly improve convergence rates over polynomial decay, approaching minimax optimality.

Contribution

It introduces the step decay schedule for SGD, demonstrating its near-optimal convergence for the final iterate in streaming least squares problems, outperforming polynomial decay schemes.

Findings

01

Step decay schedules achieve near minimax optimal rates.

02

Polynomial decay step sizes are sub-optimal for final iterate convergence.

03

Anytime behavior of SGD's final iterate is poor regardless of step size.

Abstract

Minimax optimal convergence rates for classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, SGD's final iterate behavior has received much less attention despite their widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations SGD is run for) is known in advance, SGD's final iterate behavior with any polynomially decaying learning rate scheme is highly sub-optimal compared to the minimax rate (by a condition number factor in the strongly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

D-X-Y/ResNeXt-DenseNet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods

MethodsStep Decay · Stochastic Gradient Descent