Convergence of Markov Chains for Constant Step-size Stochastic Gradient   Descent with Separable Functions

David Shirokoff; Philip Zaleski

arXiv:2409.12243·math.OC·March 26, 2025

Convergence of Markov Chains for Constant Step-size Stochastic Gradient Descent with Separable Functions

David Shirokoff, Philip Zaleski

PDF

Open Access

TL;DR

This paper analyzes the long-term behavior of constant step-size stochastic gradient descent (SGD) on separable functions, revealing a Markov chain structure with invariant measures, convergence properties, and bifurcation phenomena that challenge traditional diffusion approximations.

Contribution

It introduces a Doeblin-type decomposition for SGD Markov chains on separable functions, showing convergence to invariant measures and highlighting complex dynamics like bifurcations.

Findings

01

Invariant measures form a convex hull and are global attractors.

02

SGD can leave the global minimum, contradicting diffusion approximation assumptions.

03

Bifurcations enable transitions between local minima.

Abstract

Stochastic gradient descent (SGD) is a popular algorithm for minimizing objective functions that arise in machine learning. For constant step-sized SGD, the iterates form a Markov chain on a general state space. Focusing on a class of separable (non-convex) objective functions, we establish a "Doeblin-type decomposition," in that the state space decomposes into a uniformly transient set and a disjoint union of absorbing sets. Each of the absorbing sets contains a unique invariant measure, with the set of all invariant measures being the convex hull. Moreover the set of invariant measures are shown to be global attractors to the Markov chain with a geometric convergence rate. The theory is highlighted with examples that show: (1) the failure of the diffusion approximation to characterize the long-time dynamics of SGD; (2) the global minimum of an objective function may lie outside the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMarkov Chains and Monte Carlo Methods · Stochastic Gradient Optimization Techniques