O(1) Communication for Distributed SGD through Two-Level Gradient   Averaging

Subhadeep Bhattacharya; Weikuan Yu; Fahim Tahmid Chowdhury

arXiv:2006.07405·cs.LG·June 17, 2020

O(1) Communication for Distributed SGD through Two-Level Gradient Averaging

Subhadeep Bhattacharya, Weikuan Yu, Fahim Tahmid Chowdhury

PDF

TL;DR

This paper introduces A2SGD, a novel distributed SGD method that reduces communication complexity to O(1) per worker by using two-level gradient averaging, significantly speeding up training.

Contribution

A2SGD is the first method to achieve O(1) communication complexity per worker in distributed SGD, combining gradient consolidation with error retention for fast convergence.

Findings

01

A2SGD converges similarly to standard distributed SGD.

02

A2SGD reduces communication traffic significantly.

03

Training time improves by up to 23.2x compared to existing methods.

Abstract

Large neural network models present a hefty communication challenge to distributed Stochastic Gradient Descent (SGD), with a communication complexity of O(n) per worker for a model of n parameters. Many sparsification and quantization techniques have been proposed to compress the gradients, some reducing the communication complexity to O(k), where k << n. In this paper, we introduce a strategy called two-level gradient averaging (A2SGD) to consolidate all gradients down to merely two local averages per worker before the computation of two global averages for an updated model. A2SGD also retains local errors to maintain the variance for fast convergence. Our theoretical analysis shows that A2SGD converges similarly like the default distributed SGD algorithm. Our evaluation validates the theoretical conclusion and demonstrates that A2SGD significantly reduces the communication traffic per…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsStochastic Gradient Descent