TL;DR
This paper proposes a standardized, graph-based representation for collective algorithms in distributed machine learning, aiming to improve interoperability, co-optimization, and reduce engineering efforts across tools.
Contribution
It introduces a common collective algorithm representation based on Chakra Execution Trace, enabling better integration and simulation of collective algorithms across different tools.
Findings
Demonstrated the feasibility of the standardized workflow with a proof-of-concept.
Enabled simulation of collective algorithms across various network configurations.
Improved interoperability between collective algorithm producers and consumers.
Abstract
The explosion of machine learning model size has led to its execution on distributed clusters at a very large scale. Many works have tried to optimize the process of producing collective algorithms and running collective communications, which act as a bottleneck to distributed machine learning. However, different works use their own collective algorithm representation, pushing away from co-optimizing collective communication and the rest of the workload. The lack of a standardized collective algorithm representation has also hindered interoperability between collective algorithm producers and consumers. Additionally, tool-specific conversions and modifications have to be made for each pair of tools producing and consuming collective algorithms which adds to engineering efforts. In this position paper, we propose a standardized workflow leveraging a common collective algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
