Variance Reduction in Deep Learning: More Momentum is All You Need
Lionel Tondji, Sergii Kashubin, Moustapha Cisse

TL;DR
This paper introduces a novel multi-momentum variance reduction technique for deep learning that accelerates convergence, is robust to noise, and scalable across distributed systems, outperforming standard methods on benchmarks.
Contribution
It proposes a scalable variance reduction method combining existing optimizers with multi-momentum, tailored for deep learning datasets with clustering structure.
Findings
Faster convergence than vanilla methods on CIFAR and ImageNet
Robust to label noise in training
Suitable for distributed optimization environments
Abstract
Variance reduction (VR) techniques have contributed significantly to accelerating learning with massive datasets in the smooth and strongly convex setting (Schmidt et al., 2017; Johnson & Zhang, 2013; Roux et al., 2012). However, such techniques have not yet met the same success in the realm of large-scale deep learning due to various factors such as the use of data augmentation or regularization methods like dropout (Defazio & Bottou, 2019). This challenge has recently motivated the design of novel variance reduction techniques tailored explicitly for deep learning (Arnold et al., 2019; Ma & Yarats, 2018). This work is an additional step in this direction. In particular, we exploit the ubiquitous clustering structure of rich datasets used in deep learning to design a family of scalable variance reduced optimization procedures by combining existing optimizers (e.g., SGD+Momentum, Quasi…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Sparse and Compressive Sensing Techniques · Domain Adaptation and Few-Shot Learning
MethodsDropout
