The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
Siyuan Ma, Raef Bassily, Mikhail Belkin

TL;DR
This paper explains why stochastic gradient descent (SGD) converges quickly in modern over-parametrized models, showing the role of data interpolation and mini-batch size regimes, with theoretical bounds and experimental validation.
Contribution
It provides a formal analysis of SGD convergence in over-parametrized regimes, identifying critical mini-batch sizes and regimes, with explicit formulas for quadratic loss and experimental support.
Findings
Exponential convergence bounds for mini-batch SGD in convex settings.
Identification of a critical mini-batch size separating two regimes.
O(n) acceleration over full gradient descent per unit of computation.
Abstract
In this paper we aim to formally explain the phenomenon of fast convergence of SGD observed in modern machine learning. The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss (classification and regression) close to zero. While it is still unclear why these interpolated solutions perform well on test data, we show that these regimes allow for fast convergence of SGD, comparable in number of iterations to full gradient descent. For convex loss functions we obtain an exponential convergence bound for {\it mini-batch} SGD parallel to that for full gradient descent. We show that there is a critical batch size such that: (a) SGD iteration with mini-batch size is nearly equivalent to iterations of mini-batch size (\emph{linear scaling regime}). (b) SGD iteration with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Domain Adaptation and Few-Shot Learning
MethodsStochastic Gradient Descent
