CGX: Adaptive System Support for Communication-Efficient Deep Learning

Ilia Markov; Hamidreza Ramezanikebrya; Dan Alistarh

arXiv:2111.08617·cs.DC·January 2, 2023

CGX: Adaptive System Support for Communication-Efficient Deep Learning

Ilia Markov, Hamidreza Ramezanikebrya, Dan Alistarh

PDF

Open Access 1 Repo

TL;DR

CGX introduces a system that enables efficient compressed communication for deep learning training, reducing hardware costs and improving scalability without requiring major code changes.

Contribution

It presents a novel framework combining system-level communication stack redesign and adaptive compression techniques for scalable, cost-effective deep learning training.

Findings

01

Up to 3X speedup on multi-GPU nodes with commodity hardware

02

Order-of-magnitude improvement in multi-node training

03

Negligible impact on model accuracy

Abstract

The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth overprovisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ist-daslab/torch_cgx
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques