Convergence of Stochastic Gradient Descent with mini-batching and infinite variance
Bartosz Glowacki, Rafal Kulik, Philippe Soulier

TL;DR
This paper analyzes how mini-batched stochastic gradient descent behaves under heavy-tailed gradient noise, establishing convergence rates and distributional limits when noise follows an alpha-stable law.
Contribution
It provides new theoretical insights into SGD with increasing batch sizes under heavy-tailed noise, including convergence bounds and limit distributions.
Findings
Increasing batch sizes accelerate convergence.
SGD with batching converges in probability with a constant stepsize.
Normalized SGD iterates converge to an alpha-stable Levy-driven Ornstein-Uhlenbeck process.
Abstract
Stochastic gradient descent (SGD) with mini-batching is a standard tool in large-scale optimization, yet its theoretical properties under heavy-tailed gradient noise remain largely unexplored. In this paper we study SGD with increasing batch sizes when the gradient noise belongs to the domain of attraction of an -stable law with . Building on existing results for the finite-variance regime and for heavy-tailed SGD without batching, we establish three main results. First, we derive moment bounds for the SGD error and show that increasing batch sizes lead to faster convergence rates. In particular, batching enables convergence in probability even for a constant stepsize. Second, we prove that the properly normalized SGD iterates converge in distribution to the stationary law of an Ornstein-Uhlenbeck process driven by an -stable L\'evy process. Third,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
