Convergence Analysis of the Last Iterate in Distributed Stochastic Gradient Descent with Momentum

Difei Cheng; Ruinan Jin; Hong Qiao; Bo Zhang

arXiv:2505.10889·math.OC·May 19, 2025·Neurocomputing

Convergence Analysis of the Last Iterate in Distributed Stochastic Gradient Descent with Momentum

Difei Cheng, Ruinan Jin, Hong Qiao, Bo Zhang

PDF

Open Access

TL;DR

This paper analyzes the last-iterate convergence of distributed momentum stochastic gradient descent in non-convex settings, providing theoretical guarantees and showing momentum's acceleration effect.

Contribution

It offers the first theoretical analysis of last-iterate convergence for distributed mSGD in non-convex scenarios, including convergence rates and acceleration insights.

Findings

01

Proves almost sure and $L_2$ convergence of the last iterate.

02

Shows momentum accelerates early-stage convergence.

03

Provides experimental validation of theoretical results.

Abstract

Distributed stochastic gradient methods are widely used to preserve data privacy and ensure scalability in large-scale learning tasks. While existing theory on distributed momentum Stochastic Gradient Descent (mSGD) mainly focuses on time-averaged convergence, the more practical last-iterate convergence remains underexplored. In this work, we analyze the last-iterate convergence behavior of distributed mSGD in non-convex settings under the classical Robbins-Monro step-size schedule. We prove both almost sure convergence and $L_{2}$ convergence of the last iterate, and derive convergence rates. We further show that momentum can accelerate early-stage convergence, and provide experiments to support our theory.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Age of Information Optimization