Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds
Xufeng Cai, Cheuk Yin Lin, Jelena Diakonikolas

TL;DR
This paper refines the theoretical understanding of shuffled SGD in empirical risk minimization, showing it converges faster than previous bounds suggested, especially when viewed through a primal-dual lens, aligning theory more closely with practice.
Contribution
The authors introduce a primal-dual perspective for analyzing shuffled SGD, deriving sharper convergence bounds that better match empirical results and extend to nonsmooth and general finite-sum problems.
Findings
Bounds predict faster convergence, often by a factor of √n.
Empirical results confirm the tighter bounds on real datasets.
Analysis extends to nonsmooth convex and broader finite-sum problems.
Abstract
Stochastic gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning. Contrary to the empirical practice of sampling from the datasets without replacement and with (possible) reshuffling at each epoch, the theoretical counterpart of SGD usually relies on the assumption of sampling with replacement. It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed. For convex finite sum problems with components and under the -smoothness assumption for each component function, there are matching upper and lower bounds, under sufficiently small -- -- step sizes. Yet those bounds appear too pessimistic -- in fact, the predicted performance is generally no better than for full gradient descent -- and do not agree with the empirical observations. In this work, to narrow the gap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM
MethodsStochastic Gradient Descent · Focus
