Towards a Flexible and High-Fidelity Approach to Distributed DNN   Training Emulation

Banruo Liu; Mubarak Adetunji Ojewale; Yuhan Ding; Marco Canini

arXiv:2405.02969·cs.LG·May 7, 2024

Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation

Banruo Liu, Mubarak Adetunji Ojewale, Yuhan Ding, Marco Canini

PDF

Open Access

TL;DR

NeuronaBox is a novel emulation approach for distributed DNN training that accurately replicates real system behavior using a subset of nodes and network emulation, enabling high-fidelity performance analysis.

Contribution

This paper introduces NeuronaBox, a flexible and high-fidelity emulation framework for distributed DNN training workloads, combining real node execution with network and communication emulation.

Findings

01

Replicates real system behavior with less than 1% error

02

Accurately models collective communication operations

03

Provides a flexible platform for performance analysis

Abstract

We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment along with the collective communication operations. Initial results from a proof-of-concept implementation show that NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Processing Techniques · Context-Aware Activity Recognition Systems · Robotics and Automated Systems