Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD than   Constant Stepsize

Mert G\"urb\"uzbalaban; Yuanhan Hu; Umut \c{S}im\c{s}ekli; Lingjiong; Zhu

arXiv:2302.05516·stat.ML·August 30, 2023

Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD than Constant Stepsize

Mert G\"urb\"uzbalaban, Yuanhan Hu, Umut \c{S}im\c{s}ekli, Lingjiong, Zhu

PDF

Open Access

TL;DR

This paper investigates how cyclic and randomized stepsizes in SGD lead to heavier tails in the distribution of iterates, providing theoretical insights and empirical evidence that these stepsizes can improve generalization by influencing tail behavior.

Contribution

It introduces a general class of Markovian stepsizes, analyzes their impact on tail-index, and demonstrates how cyclic and randomized stepsizes can produce heavier tails than constant stepsize, enhancing understanding of their benefits.

Findings

01

Cyclic and randomized stepsizes can produce heavier tails in SGD.

02

Heavier tails are correlated with improved generalization.

03

Markovian stepsizes can outperform constant stepsize in experiments.

Abstract

Cyclic and randomized stepsizes are widely used in the deep learning practice and can often outperform standard stepsize choices such as constant stepsize in SGD. Despite their empirical success, not much is currently known about when and why they can theoretically improve the generalization performance. We consider a general class of Markovian stepsizes for learning, which contain i.i.d. random stepsize, cyclic stepsize as well as the constant stepsize as special cases, and motivated by the literature which shows that heaviness of the tails (measured by the so-called "tail-index") in the SGD iterates is correlated with generalization, we study tail-index and provide a number of theoretical results that demonstrate how the tail-index varies on the stepsize scheduling. Our results bring a new understanding of the benefits of cyclic and randomized stepsizes compared to constant stepsize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Advanced Multi-Objective Optimization Algorithms

MethodsStochastic Gradient Descent · Linear Regression