Benign Underfitting of Stochastic Gradient Descent
Tomer Koren, Roi Livni, Yishay Mansour, Uri Sherman

TL;DR
This paper reveals that stochastic gradient descent (SGD) can produce solutions with poor generalization, challenging the conventional understanding of its effectiveness, especially in the without-replacement setting.
Contribution
It demonstrates that without-replacement SGD can have large generalization gaps, unlike with-replacement SGD, and provides new bounds for multi-epoch regimes in convex optimization.
Findings
SGD can have a large generalization gap of Ω(1).
With-replacement SGD converges at the optimal rate.
New bounds for multi-epoch regimes improve previous results.
Abstract
We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate , and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of . Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods
MethodsStochastic Gradient Descent
