Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron
Sharan Vaswani, Francis Bach, Mark Schmidt

TL;DR
This paper demonstrates that stochastic gradient descent with acceleration converges rapidly for over-parameterized models, achieving rates comparable to deterministic methods, and introduces an improved perceptron algorithm with strong theoretical guarantees.
Contribution
It establishes new convergence rates for accelerated SGD in over-parameterized models and introduces an improved perceptron with theoretical mistake bounds.
Findings
SGD with Nesterov acceleration matches deterministic convergence rates.
Under interpolation, SGD attains the same rate as full gradient descent.
An O(1/k^2) mistake bound is proved for the stochastic perceptron.
Abstract
Modern machine learning focuses on highly expressive models that are able to fit or interpolate the data completely, resulting in zero training loss. For such models, we show that the stochastic gradients of common loss functions satisfy a strong growth condition. Under this condition, we prove that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions. We also show that this condition implies that SGD can find a first-order stationary point as efficiently as full gradient descent in non-convex settings. Under interpolation, we further show that all smooth loss functions with a finite-sum structure satisfy a weaker growth condition. Given this weaker condition, we prove that SGD with a constant step-size attains the deterministic convergence rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Machine Learning and ELM
MethodsStochastic Gradient Descent
