Gradient Methods Never Overfit On Separable Data
Ohad Shamir

TL;DR
This paper proves that standard gradient methods for linear models on separable data do not overfit in finite time, with risks decreasing optimally until the dataset size is reached, after which they stabilize.
Contribution
It provides the first non-asymptotic analysis showing gradient methods never overfit on separable data, with optimal risk bounds up to dataset size.
Findings
Empirical risk and generalization error decrease at near-optimal rates.
Generalization error stabilizes at an optimal level after T approaches dataset size m.
Non-asymptotic bounds on margin violations are established and shown to be tight.
Abstract
A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overfit. However, this does not address the question of whether overfitting might occur non-asymptotically, after some bounded number of iterations. In this paper, we formally show that standard gradient methods (in particular, gradient flow, gradient descent and stochastic gradient descent) never overfit on separable data: If we run these methods for iterations on a dataset of size , both the empirical risk and the generalization error decrease at an essentially optimal rate of up till , at which point the generalization error remains fixed at an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Sparse and Compressive Sensing Techniques
