# COMET: A Framework for Modeling Compound Operation Dataflows with Explicit Collectives

**Authors:** Shubham Negi, Manik Singhal, Aayush Ankit, Sudeep Bhoja, Kaushik Roy

arXiv: 2509.00599 · 2025-09-03

## TL;DR

COMET is a framework that models and optimizes dataflows involving compound operations with explicit collective communication, improving performance and energy efficiency for modern machine learning accelerators.

## Contribution

It introduces a novel explicit modeling approach for collective communication in compound operations, enabling better optimization of dataflows in ML accelerators.

## Key findings

- Achieves up to 1.42× speedup for GEMM-Softmax
- Achieves up to 3.46× speedup for GEMM-LayerNorm
- Achieves up to 1.82× speedup for self-attention

## Abstract

Modern machine learning accelerators are designed to efficiently execute deep neural networks (DNNs) by optimizing data movement, memory hierarchy, and compute throughput. However, emerging DNN models such as large language models, state space models increasingly rely on compound operations-structured compositions of multiple basic operations-which introduce new challenges for dataflow optimization and minimizing off-chip memory traffic. Moreover, as model size continues to grow, deployment across spatially distributed compute clusters becomes essential, requiring frequent and complex collective communication. Existing dataflow optimization frameworks and performance models either focus on single operations or lack explicit modeling of collective communication cost, limiting their applicability to modern workloads.   To address these limitations, we propose, a framework for modeling and optimizing dataflow for compound operations on machine learning accelerators. COMET introduces a novel representation that explicitly models collective communication across spatial clusters, along with latency and energy cost models that account for both GEMM and non-GEMM operation level dependencies within compound operations. We demonstrate COMET's capabilities to analyze and optimize dataflows for compound operations such as GEMM--Softmax, GEMM--LayerNorm, and self-attention, across both edge and cloud accelerator configurations. Our collective-aware modeling enables exploration of a broader mapping space, leading to improved performance and energy efficiency. Specifically, our optimized dataflows achieve up to 1.42$\times$ speedup for GEMM-Softmax, 3.46$\times$ for GEMM-LayerNorm and 1.82$\times$ for self-attention compared to unfused baselines.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00599/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00599/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/2509.00599/full.md

---
Source: https://tomesphere.com/paper/2509.00599