COMET: A Comprehensive Cluster Design Methodology for Distributed Deep   Learning Training

Divya Kiran Kadiyala; Saeed Rashidi; Taekyung Heo; Abhimanyu; Rajeshkumar Bambhaniya; Tushar Krishna; and Alexandros Daglis

arXiv:2211.16648·cs.DC·March 15, 2024·5 cites

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

Divya Kiran Kadiyala, Saeed Rashidi, Taekyung Heo, Abhimanyu, Rajeshkumar Bambhaniya, Tushar Krishna, and Alexandros Daglis

PDF

Open Access

TL;DR

COMET is a comprehensive methodology that helps design and optimize large distributed deep learning clusters by exploring the impact of resource provisioning and parallelization strategies on training performance.

Contribution

It introduces a holistic, step-by-step workflow for cluster design space exploration tailored for distributed deep learning training systems.

Findings

01

Performance differences of up to 7.7x between configurations

02

Memory expansion can improve performance by up to 1.4x

03

COMET effectively guides system design and optimization

Abstract

Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model's tuning knobs drastically affect the performance, with optimal values often depending on the underlying cluster's characteristics, which necessitates a complex cluster-workload co-design process. To facilitate the design space exploration of such massive DL training clusters, we introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training. We develop a step-by-step process to establish a reusable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Processing Techniques