Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation
Banruo Liu, Mubarak Adetunji Ojewale, Yuhan Ding, Marco Canini

TL;DR
NeuronaBox is a novel emulation approach for distributed DNN training that accurately replicates real system behavior using a subset of nodes and network emulation, enabling high-fidelity performance analysis.
Contribution
This paper introduces NeuronaBox, a flexible and high-fidelity emulation framework for distributed DNN training workloads, combining real node execution with network and communication emulation.
Findings
Replicates real system behavior with less than 1% error
Accurately models collective communication operations
Provides a flexible platform for performance analysis
Abstract
We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment along with the collective communication operations. Initial results from a proof-of-concept implementation show that NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Processing Techniques · Context-Aware Activity Recognition Systems · Robotics and Automated Systems
