MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
Srinivas Sridharan, Theodor-Adrian Badea, Andy Balogh, Bradford M. Beckmann, Brian Coutinho, Louis Feng, Sheng Fu, Sanshan Gao, Mehryar Garakani, Taekyung Heo, David Kanter, Josh Ladd, Ziwei Li, Winston Liu, Changhai Man, Dan Mihailescu, Spandan More, Joongun Park

TL;DR
Chakra is an open ecosystem that uses standardized execution traces to improve performance benchmarking and co-design in AI systems, enabling better observation, reproduction, and optimization of distributed ML workloads.
Contribution
The paper introduces Chakra, a portable framework with a graph-based execution trace format for performance analysis and co-design of AI/ML workloads across diverse tools and platforms.
Findings
Chakra ETs effectively represent key operations and dependencies in distributed AI workloads.
Real-world case studies demonstrate Chakra's utility in optimizing AI system performance.
Industry adoption includes major companies like NVIDIA, AMD, Meta, and others.
Abstract
The fast pace of artificial intelligence~(AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning~(ML) workload behavior in production AI systems and enables efficient software-hardware~(SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
