Towards a Standardized Representation for Deep Learning Collective Algorithms

Jinsun Yoo; William Won; Meghan Cowan; Nan Jiang; Benjamin Klenk; Srinivas Sridharan; Tushar Krishna

arXiv:2408.11008·cs.DC·October 21, 2025

Towards a Standardized Representation for Deep Learning Collective Algorithms

Jinsun Yoo, William Won, Meghan Cowan, Nan Jiang, Benjamin Klenk, Srinivas Sridharan, Tushar Krishna

PDF

1 Repo

TL;DR

This paper proposes a standardized, graph-based representation for collective algorithms in distributed machine learning, aiming to improve interoperability, co-optimization, and reduce engineering efforts across tools.

Contribution

It introduces a common collective algorithm representation based on Chakra Execution Trace, enabling better integration and simulation of collective algorithms across different tools.

Findings

01

Demonstrated the feasibility of the standardized workflow with a proof-of-concept.

02

Enabled simulation of collective algorithms across various network configurations.

03

Improved interoperability between collective algorithm producers and consumers.

Abstract

The explosion of machine learning model size has led to its execution on distributed clusters at a very large scale. Many works have tried to optimize the process of producing collective algorithms and running collective communications, which act as a bottleneck to distributed machine learning. However, different works use their own collective algorithm representation, pushing away from co-optimizing collective communication and the rest of the workload. The lack of a standardized collective algorithm representation has also hindered interoperability between collective algorithm producers and consumers. Additionally, tool-specific conversions and modifications have to be made for each pair of tools producing and consuming collective algorithms which adds to engineering efforts. In this position paper, we propose a standardized workflow leveraging a common collective algorithm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

astra-sim/collectiveapi
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.