Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective   and Improved Bounds

Xufeng Cai; Cheuk Yin Lin; Jelena Diakonikolas

arXiv:2306.12498·math.OC·February 8, 2024·1 cites

Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds

Xufeng Cai, Cheuk Yin Lin, Jelena Diakonikolas

PDF

Open Access

TL;DR

This paper refines the theoretical understanding of shuffled SGD in empirical risk minimization, showing it converges faster than previous bounds suggested, especially when viewed through a primal-dual lens, aligning theory more closely with practice.

Contribution

The authors introduce a primal-dual perspective for analyzing shuffled SGD, deriving sharper convergence bounds that better match empirical results and extend to nonsmooth and general finite-sum problems.

Findings

01

Bounds predict faster convergence, often by a factor of √n.

02

Empirical results confirm the tighter bounds on real datasets.

03

Analysis extends to nonsmooth convex and broader finite-sum problems.

Abstract

Stochastic gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning. Contrary to the empirical practice of sampling from the datasets without replacement and with (possible) reshuffling at each epoch, the theoretical counterpart of SGD usually relies on the assumption of sampling with replacement. It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed. For convex finite sum problems with $n$ components and under the $L$ -smoothness assumption for each component function, there are matching upper and lower bounds, under sufficiently small -- $O (\frac{1}{n L})$ -- step sizes. Yet those bounds appear too pessimistic -- in fact, the predicted performance is generally no better than for full gradient descent -- and do not agree with the empirical observations. In this work, to narrow the gap…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM

MethodsStochastic Gradient Descent · Focus