Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD than Constant Stepsize
Mert G\"urb\"uzbalaban, Yuanhan Hu, Umut \c{S}im\c{s}ekli, Lingjiong, Zhu

TL;DR
This paper investigates how cyclic and randomized stepsizes in SGD lead to heavier tails in the distribution of iterates, providing theoretical insights and empirical evidence that these stepsizes can improve generalization by influencing tail behavior.
Contribution
It introduces a general class of Markovian stepsizes, analyzes their impact on tail-index, and demonstrates how cyclic and randomized stepsizes can produce heavier tails than constant stepsize, enhancing understanding of their benefits.
Findings
Cyclic and randomized stepsizes can produce heavier tails in SGD.
Heavier tails are correlated with improved generalization.
Markovian stepsizes can outperform constant stepsize in experiments.
Abstract
Cyclic and randomized stepsizes are widely used in the deep learning practice and can often outperform standard stepsize choices such as constant stepsize in SGD. Despite their empirical success, not much is currently known about when and why they can theoretically improve the generalization performance. We consider a general class of Markovian stepsizes for learning, which contain i.i.d. random stepsize, cyclic stepsize as well as the constant stepsize as special cases, and motivated by the literature which shows that heaviness of the tails (measured by the so-called "tail-index") in the SGD iterates is correlated with generalization, we study tail-index and provide a number of theoretical results that demonstrate how the tail-index varies on the stepsize scheduling. Our results bring a new understanding of the benefits of cyclic and randomized stepsizes compared to constant stepsize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Advanced Multi-Objective Optimization Algorithms
MethodsStochastic Gradient Descent · Linear Regression
