TACCL: Guiding Collective Algorithm Synthesis using Communication   Sketches

Aashaka Shah; Vijay Chidambaram; Meghan Cowan; Saeed Maleki; Madan; Musuvathi; Todd Mytkowicz; Jacob Nelson; Olli Saarikivi; Rachee Singh

arXiv:2111.04867·cs.DC·October 6, 2022·6 cites

TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches

Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan, Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, Rachee Singh

PDF

Open Access 2 Repos

TL;DR

TACCL is a tool that guides the automatic synthesis of efficient collective communication algorithms for multi-GPU training, significantly outperforming existing libraries and accelerating training of large models.

Contribution

It introduces a novel communication sketch abstraction and encoding method to efficiently synthesize algorithms tailored for specific hardware and communication collectives.

Findings

01

Synthesized algorithms outperform NCCL by up to 6.7x.

02

TACCL accelerates training of Transformer-XL and BERT models by 11% to 2.3x.

03

Successfully scales beyond single-node topologies.

Abstract

Machine learning models are increasingly being trained across multiple GPUs and servers. In this setting, data is transferred between GPUs using communication collectives such as AlltoAll and AllReduce, which can become a significant bottleneck in training large models. Thus, it is important to use efficient algorithms for collective communication. We develop TACCL, a tool that enables algorithm designers to guide a synthesizer into automatically generating algorithms for a given hardware configuration and communication collective. TACCL uses a novel communication sketch abstraction to get crucial information from the designer to significantly reduce the search space and guide the synthesizer towards better algorithms. TACCL also uses a novel encoding of the problem that allows it to scale beyond single-node topologies. We use TACCL to synthesize algorithms for three collectives and two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Neural Networks and Reservoir Computing · Parallel Computing and Optimization Techniques

MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Attention Dropout · Adaptive Softmax · Layer Normalization · Variational Dropout · Linear Warmup With Linear Decay · Cosine Annealing · Linear Warmup With Cosine Annealing