The Power of Interpolation: Understanding the Effectiveness of SGD in   Modern Over-parametrized Learning

Siyuan Ma; Raef Bassily; Mikhail Belkin

arXiv:1712.06559·cs.LG·June 18, 2018·38 cites

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

Siyuan Ma, Raef Bassily, Mikhail Belkin

PDF

Open Access

TL;DR

This paper explains why stochastic gradient descent (SGD) converges quickly in modern over-parametrized models, showing the role of data interpolation and mini-batch size regimes, with theoretical bounds and experimental validation.

Contribution

It provides a formal analysis of SGD convergence in over-parametrized regimes, identifying critical mini-batch sizes and regimes, with explicit formulas for quadratic loss and experimental support.

Findings

01

Exponential convergence bounds for mini-batch SGD in convex settings.

02

Identification of a critical mini-batch size separating two regimes.

03

O(n) acceleration over full gradient descent per unit of computation.

Abstract

In this paper we aim to formally explain the phenomenon of fast convergence of SGD observed in modern machine learning. The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss (classification and regression) close to zero. While it is still unclear why these interpolated solutions perform well on test data, we show that these regimes allow for fast convergence of SGD, comparable in number of iterations to full gradient descent. For convex loss functions we obtain an exponential convergence bound for {\it mini-batch} SGD parallel to that for full gradient descent. We show that there is a critical batch size $m^{*}$ such that: (a) SGD iteration with mini-batch size $m \leq m^{*}$ is nearly equivalent to $m$ iterations of mini-batch size $1$ (\emph{linear scaling regime}). (b) SGD iteration with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Domain Adaptation and Few-Shot Learning

MethodsStochastic Gradient Descent