Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous   GPU Clusters

WenZheng Zhang; Yang Hu; Jing Shi; Xiaoying Bai

arXiv:2408.12596·cs.DC·August 23, 2024

Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters

WenZheng Zhang, Yang Hu, Jing Shi, Xiaoying Bai

PDF

Open Access 1 Video

TL;DR

Poplar is a system that efficiently scales distributed DNN training on heterogeneous GPU clusters by optimizing resource utilization and automating parallelism, significantly improving training throughput over existing methods.

Contribution

It extends ZeRO with heterogeneity-aware features, introduces a novel batch allocation and search algorithm, and automates parallelism to enhance training efficiency on diverse GPU clusters.

Findings

01

Achieves 1.02-3.92x throughput improvement over state-of-the-art systems.

02

Effectively handles various GPU heterogeneity conditions.

03

Demonstrates scalability across multiple heterogeneous GPU clusters.

Abstract

Scaling Deep Neural Networks (DNNs) requires significant computational resources in terms of GPU quantity and compute capacity. In practice, there usually exists a large number of heterogeneous GPU devices due to the rapid release cycle of GPU products. It is highly needed to efficiently and economically harness the power of heterogeneous GPUs, so that it can meet the requirements of DNN research and development. The paper introduces Poplar, a distributed training system that extends Zero Redundancy Optimizer (ZeRO) with heterogeneous-aware capabilities. We explore a broader spectrum of GPU heterogeneity, including compute capability, memory capacity, quantity and a combination of them. In order to achieve high computational efficiency across all heterogeneous conditions, Poplar conducts fine-grained measurements of GPUs in each ZeRO stage. We propose a novel batch allocation method and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters· underline

Taxonomy

TopicsBrain Tumor Detection and Classification · Advanced Neural Network Applications · Speech Recognition and Synthesis