Rethinking Machine Learning Collective Communication as a   Multi-Commodity Flow Problem

Behnaz Arzani; Siva Kesava Reddy Kakarla; Miguel Castro; Srikanth; Kandula; Saeed Maleki; Luke Marshall

arXiv:2305.13479·cs.NI·May 24, 2023·2 cites

Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

Behnaz Arzani, Siva Kesava Reddy Kakarla, Miguel Castro, Srikanth, Kandula, Saeed Maleki, Luke Marshall

PDF

Open Access

TL;DR

This paper introduces TECCL, a new method for ML collective communication scheduling that models the problem as a multi-commodity flow, leading to faster and more efficient schedules on large GPU topologies.

Contribution

It redefines ML collective communication scheduling as a multi-commodity flow problem and proposes TECCL, which outperforms existing methods in speed and efficiency on large-scale GPU topologies.

Findings

01

TECCL produces faster schedules for ML collectives.

02

TECCL reduces communication bytes compared to previous methods.

03

TECCL scales better on larger GPU topologies.

Abstract

We show communication schedulers' recent work proposed for ML collectives does not scale to the increasing problem sizes that arise from training larger models. These works also often produce suboptimal schedules. We make a connection with similar problems in traffic engineering and propose a new method, TECCL, that finds better quality schedules (e.g., finishes collectives faster and/or while sending fewer bytes) and does so more quickly on larger topologies. We present results on many different GPU topologies that show substantial improvement over the state-of-the-art.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Stochastic Gradient Optimization Techniques · Neural Networks and Applications