Breaking the Computation and Communication Abstraction Barrier in   Distributed Machine Learning Workloads

Abhinav Jangda; Jun Huang; Guodong Liu; Amir Hossein Nodehi Sabet,; Saeed Maleki; Youshan Miao; Madanlal Musuvathi; Todd Mytkowicz; Olli Sarikivi

arXiv:2105.05720·cs.DC·March 29, 2022

Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads

Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet,, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, Olli Sarikivi

PDF

Open Access 2 Repos

TL;DR

This paper introduces CoCoNeT, a high-level framework that unifies computation and communication in distributed machine learning, enabling advanced optimizations and significantly improving performance in training large models.

Contribution

CoCoNeT provides a novel DSL and compiler that treat computation and communication as first-class citizens, facilitating cross-layer optimizations in distributed ML workloads.

Findings

01

CoCoNeT outperforms existing distributed ML implementations.

02

Enables optimization of data, model, and pipeline parallelism.

03

Achieves significant performance improvements with minimal code changes.

Abstract

Recent trend towards increasing large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, current logical separation between computation and communication kernels in deep learning frameworks misses the optimization opportunities across such barrier. Breaking this abstraction with a holistic consideration can provide many optimizations to provide performance improvements in distributed workloads. Manually applying these optimizations needs modifications in underlying computation and communication libraries for each scenario, which is time consuming and error-prone. Therefore, we present CoCoNeT, with a DSL to express a program with both computation and communication. CoCoNeT contains several…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices

MethodsAttentive Walk-Aggregating Graph Neural Network