Gradient Diversity: a Key Ingredient for Scalable Distributed Learning
Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan, Ramchandran, Peter Bartlett

TL;DR
This paper introduces the concept of gradient diversity, demonstrating its importance in achieving scalable distributed mini-batch SGD without sacrificing generalization, and provides theoretical and experimental evidence for its impact.
Contribution
It defines gradient diversity, proves its influence on SGD speedup and generalization, and analyzes how techniques like dropout can enhance it.
Findings
High gradient diversity enables better speedups in distributed SGD.
Lack of gradient diversity causes convergence slowdown beyond certain batch sizes.
Heuristics like dropout can improve gradient diversity.
Abstract
It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size. In this work, we present an analysis hinting that high similarity between concurrently processed gradients may be a cause of this performance degradation. We introduce the notion of gradient diversity that measures the dissimilarity between concurrent gradient updates, and show its key role in the performance of mini-batch SGD. We prove that on problems with high gradient diversity, mini-batch SGD is amenable to better speedups, while maintaining the generalization performance of serial (one sample) SGD. We further establish lower bounds on convergence where mini-batch SGD slows down beyond a particular batch-size, solely due to the lack of gradient diversity. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Privacy-Preserving Technologies in Data
MethodsStochastic Gradient Descent
