A Survey of End-to-End Modeling for Distributed DNN Training: Workloads, Simulators, and TCO
Jonas Svedas, Hannah Watson, Nathan Laubeuf, Diksha Moolchandani, Abubakr Nada, Arjun Singh, Dwaipayan Biswas, James Myers, Debjyoti Bhattacharjee

TL;DR
This survey reviews the landscape of distributed DNN training simulators, focusing on workload representation, simulation infrastructure, and TCO models, highlighting trends, limitations, and future research directions.
Contribution
It provides a comprehensive overview of existing simulators and models for distributed DNN training, emphasizing workload abstraction, simulation frameworks, and TCO/emissions analysis.
Findings
Comparison of simulation frameworks and TCO models
Identification of common limitations in current tools
Highlighting emerging trends and open challenges
Abstract
Distributed deep neural networks (DNNs) have become a cornerstone for scaling machine learning to meet the demands of increasingly complex applications. However, the rapid growth in model complexity far outpaces CMOS technology scaling, making sustainable and efficient system design a critical challenge. Addressing this requires coordinated co-design across software, hardware, and technology layers. Due to the prohibitive cost and complexity of deploying full-scale training systems, simulators play a pivotal role in enabling this design exploration. This survey reviews the landscape of distributed DNN training simulators, focusing on three major dimensions: workload representation, simulation infrastructure, and models for total cost of ownership (TCO) including carbon emissions. It covers how workloads are abstracted and used in simulation, outlines common workload representation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning in Materials Science · Parallel Computing and Optimization Techniques
