Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime
Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu and, Sham M. Kakade

TL;DR
This paper provides a detailed, instance-dependent analysis of multi-pass SGD's generalization performance for least squares in the interpolation regime, highlighting its advantages and limitations compared to GD.
Contribution
It develops a sharp, instance-dependent excess risk bound for multi-pass SGD, connecting it to GD and analyzing its efficiency and generalization behavior.
Findings
SGD's excess risk decomposes into GD's excess risk plus fluctuation error.
SGD generally performs worse than GD in terms of excess risk on a per-instance basis.
SGD requires more iterations than GD but uses fewer gradient evaluations, saving computational time.
Abstract
Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Neural Networks and Applications
MethodsStochastic Gradient Descent
