COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training
Divya Kiran Kadiyala, Saeed Rashidi, Taekyung Heo, Abhimanyu, Rajeshkumar Bambhaniya, Tushar Krishna, and Alexandros Daglis

TL;DR
COMET is a comprehensive methodology that helps design and optimize large distributed deep learning clusters by exploring the impact of resource provisioning and parallelization strategies on training performance.
Contribution
It introduces a holistic, step-by-step workflow for cluster design space exploration tailored for distributed deep learning training systems.
Findings
Performance differences of up to 7.7x between configurations
Memory expansion can improve performance by up to 1.4x
COMET effectively guides system design and optimization
Abstract
Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model's tuning knobs drastically affect the performance, with optimal values often depending on the underlying cluster's characteristics, which necessitates a complex cluster-workload co-design process. To facilitate the design space exploration of such massive DL training clusters, we introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training. We develop a step-by-step process to establish a reusable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Processing Techniques
