Enabling Compute-Communication Overlap in Distributed Deep Learning   Training Platforms

Saeed Rashidi; Matthew Denton; Srinivas Sridharan; Sudarshan; Srinivasan; Amoghavarsha Suresh; Jade Ni; Tushar Krishna

arXiv:2007.00156·cs.AR·May 5, 2022

Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms

Saeed Rashidi, Matthew Denton, Srinivas Sridharan, Sudarshan, Srinivasan, Amoghavarsha Suresh, Jade Ni, Tushar Krishna

PDF

TL;DR

This paper introduces ACE, a novel accelerator for distributed deep learning that overlaps compute and communication, significantly reducing bandwidth demands and improving training efficiency.

Contribution

The work provides detailed analysis of compute and memory bandwidth demands and proposes ACE, a new accelerator that enhances bandwidth utilization and accelerates training.

Findings

01

ACE reduces memory bandwidth requirements by 3.5X.

02

ACE improves network bandwidth utilization by up to 2.67X.

03

ACE accelerates training iteration times by up to 1.51X.

Abstract

Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator's compute and memory for both DL computations and communication. This work makes two key contributions. First, via real system measurements and detailed modeling, we provide an understanding of compute and memory bandwidth demands for DL compute and comms. Second, we propose a novel DL collective communication accelerator called Accelerator Collectives Engine (ACE) that sits alongside the compute and networking engines at the accelerator endpoint. ACE frees up the endpoint's compute and memory resources for DL compute, which in turn reduces the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.