O(1) Communication for Distributed SGD through Two-Level Gradient Averaging
Subhadeep Bhattacharya, Weikuan Yu, Fahim Tahmid Chowdhury

TL;DR
This paper introduces A2SGD, a novel distributed SGD method that reduces communication complexity to O(1) per worker by using two-level gradient averaging, significantly speeding up training.
Contribution
A2SGD is the first method to achieve O(1) communication complexity per worker in distributed SGD, combining gradient consolidation with error retention for fast convergence.
Findings
A2SGD converges similarly to standard distributed SGD.
A2SGD reduces communication traffic significantly.
Training time improves by up to 23.2x compared to existing methods.
Abstract
Large neural network models present a hefty communication challenge to distributed Stochastic Gradient Descent (SGD), with a communication complexity of O(n) per worker for a model of n parameters. Many sparsification and quantization techniques have been proposed to compress the gradients, some reducing the communication complexity to O(k), where k << n. In this paper, we introduce a strategy called two-level gradient averaging (A2SGD) to consolidate all gradients down to merely two local averages per worker before the computation of two global averages for an updated model. A2SGD also retains local errors to maintain the variance for fast convergence. Our theoretical analysis shows that A2SGD converges similarly like the default distributed SGD algorithm. Our evaluation validates the theoretical conclusion and demonstrates that A2SGD significantly reduces the communication traffic per…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsStochastic Gradient Descent
