Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration

Sachin Garg; Micha{\l} Derezi\'nski

arXiv:2605.18609·cs.LG·May 19, 2026

Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration

Sachin Garg, Micha{\l} Derezi\'nski

PDF

TL;DR

This paper develops a theoretical framework showing that classical momentum methods can be perfectly parallelized in mini-batch stochastic gradient descent, with acceleration proportional to batch size, applicable to deep learning models.

Contribution

It introduces a general theory for stochastic momentum acceleration in quadratic optimization, encompassing various momentum schemes and mini-batch sizes, with minimal noise assumptions.

Findings

01

Acceleration scales with mini-batch size up to saturation

02

Provides a simple, effective momentum parameter choice

03

Enables perfect parallelization of mini-batch computations

Abstract

Accelerating stochastic gradient methods with classical momentum schemes, such as Polyak's heavy ball, has proven highly successful in training large-scale machine learning models, particularly when combined with the hardware acceleration of large mini-batch computations. Yet, the effect of classical momentum on stochastic mini-batch optimization has been poorly understood theoretically, with prior works requiring strong noise assumptions and extremely large mini-batches. In this work, we develop a general theory of stochastic momentum acceleration for optimizing over quadratics in the interpolation regime, a popular abstraction for studying deep learning dynamics which also includes classical methods such as randomized Kaczmarz and coordinate descent. Our framework encompasses both heavy ball and Nesterov-style momentum, allows for arbitrary mini-batch sizes, and makes minimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.