Random Reshuffling with Variance Reduction: New Analysis and Better Rates
Grigory Malinovsky, Alibek Sailanbayev, Peter Richt\'arik

TL;DR
This paper provides new theoretical analysis and improved convergence rates for variance-reduced stochastic gradient methods under random reshuffling, including SVRG variants, in both strongly-convex and convex settings.
Contribution
It introduces the first analysis of RR-SVRG with improved convergence rates and extends results to cyclic and shuffle-once variants, along with a generalized variance reduction scheme.
Findings
RR-SVRG converges linearly with rate O(κ^{3/2}) in strongly-convex case
Rate improves to O(κ) in big data regime (n > O(κ))
First sublinear rate established for general convex problems
Abstract
Virtually all state-of-the-art methods for training supervised machine learning models are variants of SGD enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. One of the tricks that works so well in practice that it is used as default in virtually all widely used machine learning software is {\em random reshuffling (RR)}. However, the practical benefits of RR have until very recently been eluding attempts at being satisfactorily explained using theory. Motivated by recent development due to Mishchenko, Khaled and Richt\'{a}rik (2020), in this work we provide the first analysis of SVRG under Random Reshuffling (RR-SVRG) for general finite-sum problems. First, we show that RR-SVRG converges linearly with the rate in the strongly-convex case, and can be improved further to in the big data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Privacy-Preserving Technologies in Data
MethodsStochastic Gradient Descent
