Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO

Jonas Svedas; Nathan Laubeuf; Ryan Harvey; Arjun Singh; Changhai Man; Abubakr Nada; Tushar Krishna; James Myers; Debjyoti Bhattacharjee

arXiv:2604.12090·cs.DC·April 15, 2026

Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO

Jonas Svedas, Nathan Laubeuf, Ryan Harvey, Arjun Singh, Changhai Man, Abubakr Nada, Tushar Krishna, James Myers, Debjyoti Bhattacharjee

PDF

TL;DR

This paper explores using MLIR's StableHLO dialect as a unified representation for cross-architecture performance modeling of distributed ML workloads, enabling portable and comparative analysis across GPUs and TPUs.

Contribution

It introduces a StableHLO-based simulation methodology that maps a single workload onto multiple performance models, facilitating cross-platform and fidelity comparisons without physical hardware.

Findings

01

StableHLO preserves relative performance trends across architectures.

02

Prediction errors are within practical bounds for early-stage design.

03

Fidelity-dependent limitations are exposed in existing GPU simulators.

Abstract

Predicting the performance of large-scale distributed machine learning (ML) workloads across multiple accelerator architectures remains a central challenge in ML system design. Existing GPU and TPU focused simulators are typically architecture-specific, while distributed training simulators rely on workload-specific analytical models or costly post-execution traces, limiting portability and cross-platform comparison. This work evaluates whether MLIR's StableHLO dialect can serve as a unified workload representation for cross-architecture and cross-fidelity performance modeling of distributed ML workloads. The study establishes a StableHLO-based simulation methodology that maps a single workload representation onto multiple performance models, spanning analytical, profiling-based, and simulator-driven predictors. Using this methodology, workloads are evaluated across GPUs and TPUs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.