Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate
Mor Shpigel Nacson, Nathan Srebro, Daniel Soudry

TL;DR
This paper proves that stochastic gradient descent (SGD) converges to zero loss with a fixed learning rate for linearly separable data and homogeneous linear classifiers, providing new insights into its convergence behavior and margin maximization.
Contribution
It establishes the first convergence proof of SGD with a fixed learning rate for monotone loss functions on separable data, including logistic loss, and analyzes the effect of minibatch size.
Findings
SGD converges to zero loss with a fixed learning rate on separable data.
The weight vector aligns with the max-margin direction at a rate of O(1/log(t)).
Convergence rate is independent of minibatch size under certain conditions.
Abstract
Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the max margin vector as for almost all separable datasets, and the loss converges as - similarly to gradient descent. Lastly, we examine the case of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Sparse and Compressive Sensing Techniques
MethodsStochastic Gradient Descent
