Gradient Diversity: a Key Ingredient for Scalable Distributed Learning

Dong Yin; Ashwin Pananjady; Max Lam; Dimitris Papailiopoulos; Kannan; Ramchandran; Peter Bartlett

arXiv:1706.05699·cs.LG·January 9, 2018·20 cites

Gradient Diversity: a Key Ingredient for Scalable Distributed Learning

Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan, Ramchandran, Peter Bartlett

PDF

Open Access

TL;DR

This paper introduces the concept of gradient diversity, demonstrating its importance in achieving scalable distributed mini-batch SGD without sacrificing generalization, and provides theoretical and experimental evidence for its impact.

Contribution

It defines gradient diversity, proves its influence on SGD speedup and generalization, and analyzes how techniques like dropout can enhance it.

Findings

01

High gradient diversity enables better speedups in distributed SGD.

02

Lack of gradient diversity causes convergence slowdown beyond certain batch sizes.

03

Heuristics like dropout can improve gradient diversity.

Abstract

It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size. In this work, we present an analysis hinting that high similarity between concurrently processed gradients may be a cause of this performance degradation. We introduce the notion of gradient diversity that measures the dissimilarity between concurrent gradient updates, and show its key role in the performance of mini-batch SGD. We prove that on problems with high gradient diversity, mini-batch SGD is amenable to better speedups, while maintaining the generalization performance of serial (one sample) SGD. We further establish lower bounds on convergence where mini-batch SGD slows down beyond a particular batch-size, solely due to the lack of gradient diversity. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Privacy-Preserving Technologies in Data

MethodsStochastic Gradient Descent