Synthesizing Optimal Collective Algorithms
Zixian Cai, Zhengyang Liu, Saeed Maleki, Madan Musuvathi, Todd, Mytkowicz, Jacob Nelson, Olli Saarikivi

TL;DR
This paper presents SCCL, a systematic synthesis approach for creating optimized collective communication algorithms tailored to specific hardware topologies, improving performance in distributed deep learning.
Contribution
SCCL introduces a formal synthesis method for collective algorithms, generating latency and bandwidth optimal solutions tailored to hardware topologies, outperforming existing libraries.
Findings
Synthesized novel latency and bandwidth optimal algorithms.
Successfully applied to NVIDIA and AMD architectures.
Achieved competitive performance with hand-optimized libraries.
Abstract
Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's bottleneck of data-parallel training. This paper introduces SCCL (for Synthesized Collective Communication Library), a systematic approach to synthesize collective communication algorithms that are explicitly tailored to a particular hardware topology. SCCL synthesizes algorithms along the Pareto-frontier spanning from latency-optimal to bandwidth-optimal implementations of a collective. The paper demonstrates how to encode SCCL's synthesis as a quantifier-free SMT formula which can be discharged to a theorem prover. We further demonstrate how to scale our synthesis by exploiting symmetries in topologies and collectives. We synthesize and introduce novel latency and bandwidth optimal algorithms not seen in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
