Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization

Shira Vansover-Hager; Tomer Koren; Roi Livni

arXiv:2505.08306·cs.LG·May 16, 2025

Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization

Shira Vansover-Hager, Tomer Koren, Roi Livni

PDF

TL;DR

This paper investigates how multi-pass stochastic gradient descent (SGD) can cause overfitting in stochastic convex optimization, revealing phase transitions and bounds on generalization error after multiple epochs.

Contribution

It provides the first detailed analysis of overfitting in multi-pass SGD in non-smooth stochastic convex optimization, including bounds and phase transition phenomena.

Findings

01

Multiple passes of SGD can significantly worsen out-of-sample performance.

02

A sharp phase transition occurs after the first epoch in SGD's out-of-sample behavior.

03

Theoretical bounds on the generalization gap for multi-pass SGD are established.

Abstract

We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $Θ (1/ n)$ excess population loss given a sample of size $n$ , much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $η = Θ (1/ n)$ , which gives the optimal rate after one pass, can lead to population loss as large as $Ω (1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $Θ (1/ (η T) + η T)$ , where $T$ is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsStochastic Gradient Descent