Parallelizing Stochastic Gradient Descent for Least Squares Regression: mini-batching, averaging, and model misspecification
Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli,, Aaron Sidford

TL;DR
This paper provides a detailed analysis of mini-batching and tail-averaging in stochastic gradient descent for least squares regression, demonstrating near-linear parallelization speedups and insights into noise effects on convergence.
Contribution
It offers non-asymptotic excess risk bounds for averaging schemes, characterizes parallelization speedups, and analyzes the impact of noise on stepsize choices in SGD.
Findings
Mini-batching reduces variance and enables near-linear parallel speedups.
Tail-averaging decreases variance in the final iterate of SGD.
The analysis shows how noise properties influence optimal stepsize choices.
Abstract
This work characterizes the benefits of averaging schemes widely used in conjunction with stochastic gradient descent (SGD). In particular, this work provides a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of the stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD to decrease the variance in SGD's final iterate. This work presents non-asymptotic excess risk bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problem-dependent extent to which mini-batch SGD yields provable near-linear parallelization speedups over SGD with batch size one. This allows for understanding learning rate versus batch size tradeoffs for the final iterate of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- MindCode-4/code-11/tree/main/AccSGD-Parallelizing-Stochastic-Gradient-Descentmindspore
- rahulkidambi/AccSGDpytorch
- MindCode-4/code-6/tree/main/AccSGD-Parallelizing-Stochastic-Gradient-Descentmindspore
- mindspore-ai/contrib/blob/master/application/AccSGD-Parallelizing-Stochastic-Gradient-Descent/AccSGD.pymindspore
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Statistical Methods and Inference
MethodsStochastic Gradient Descent
