TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan, Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, Rachee Singh

TL;DR
TACCL is a tool that guides the automatic synthesis of efficient collective communication algorithms for multi-GPU training, significantly outperforming existing libraries and accelerating training of large models.
Contribution
It introduces a novel communication sketch abstraction and encoding method to efficiently synthesize algorithms tailored for specific hardware and communication collectives.
Findings
Synthesized algorithms outperform NCCL by up to 6.7x.
TACCL accelerates training of Transformer-XL and BERT models by 11% to 2.3x.
Successfully scales beyond single-node topologies.
Abstract
Machine learning models are increasingly being trained across multiple GPUs and servers. In this setting, data is transferred between GPUs using communication collectives such as AlltoAll and AllReduce, which can become a significant bottleneck in training large models. Thus, it is important to use efficient algorithms for collective communication. We develop TACCL, a tool that enables algorithm designers to guide a synthesizer into automatically generating algorithms for a given hardware configuration and communication collective. TACCL uses a novel communication sketch abstraction to get crucial information from the designer to significantly reduce the search space and guide the synthesizer towards better algorithms. TACCL also uses a novel encoding of the problem that allows it to scale beyond single-node topologies. We use TACCL to synthesize algorithms for three collectives and two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Neural Networks and Reservoir Computing · Parallel Computing and Optimization Techniques
MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Attention Dropout · Adaptive Softmax · Layer Normalization · Variational Dropout · Linear Warmup With Linear Decay · Cosine Annealing · Linear Warmup With Cosine Annealing
