Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem
Behnaz Arzani, Siva Kesava Reddy Kakarla, Miguel Castro, Srikanth, Kandula, Saeed Maleki, Luke Marshall

TL;DR
This paper introduces TECCL, a new method for ML collective communication scheduling that models the problem as a multi-commodity flow, leading to faster and more efficient schedules on large GPU topologies.
Contribution
It redefines ML collective communication scheduling as a multi-commodity flow problem and proposes TECCL, which outperforms existing methods in speed and efficiency on large-scale GPU topologies.
Findings
TECCL produces faster schedules for ML collectives.
TECCL reduces communication bytes compared to previous methods.
TECCL scales better on larger GPU topologies.
Abstract
We show communication schedulers' recent work proposed for ML collectives does not scale to the increasing problem sizes that arise from training larger models. These works also often produce suboptimal schedules. We make a connection with similar problems in traffic engineering and propose a new method, TECCL, that finds better quality schedules (e.g., finishes collectives faster and/or while sending fewer bytes) and does so more quickly on larger topologies. We present results on many different GPU topologies that show substantial improvement over the state-of-the-art.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Stochastic Gradient Optimization Techniques · Neural Networks and Applications
