Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration
Sachin Garg, Micha{\l} Derezi\'nski

TL;DR
This paper develops a theoretical framework showing that classical momentum methods can be perfectly parallelized in mini-batch stochastic gradient descent, with acceleration proportional to batch size, applicable to deep learning models.
Contribution
It introduces a general theory for stochastic momentum acceleration in quadratic optimization, encompassing various momentum schemes and mini-batch sizes, with minimal noise assumptions.
Findings
Acceleration scales with mini-batch size up to saturation
Provides a simple, effective momentum parameter choice
Enables perfect parallelization of mini-batch computations
Abstract
Accelerating stochastic gradient methods with classical momentum schemes, such as Polyak's heavy ball, has proven highly successful in training large-scale machine learning models, particularly when combined with the hardware acceleration of large mini-batch computations. Yet, the effect of classical momentum on stochastic mini-batch optimization has been poorly understood theoretically, with prior works requiring strong noise assumptions and extremely large mini-batches. In this work, we develop a general theory of stochastic momentum acceleration for optimizing over quadratics in the interpolation regime, a popular abstraction for studying deep learning dynamics which also includes classical methods such as randomized Kaczmarz and coordinate descent. Our framework encompasses both heavy ball and Nesterov-style momentum, allows for arbitrary mini-batch sizes, and makes minimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
