Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters
WenZheng Zhang, Yang Hu, Jing Shi, Xiaoying Bai

TL;DR
Poplar is a system that efficiently scales distributed DNN training on heterogeneous GPU clusters by optimizing resource utilization and automating parallelism, significantly improving training throughput over existing methods.
Contribution
It extends ZeRO with heterogeneity-aware features, introduces a novel batch allocation and search algorithm, and automates parallelism to enhance training efficiency on diverse GPU clusters.
Findings
Achieves 1.02-3.92x throughput improvement over state-of-the-art systems.
Effectively handles various GPU heterogeneity conditions.
Demonstrates scalability across multiple heterogeneous GPU clusters.
Abstract
Scaling Deep Neural Networks (DNNs) requires significant computational resources in terms of GPU quantity and compute capacity. In practice, there usually exists a large number of heterogeneous GPU devices due to the rapid release cycle of GPU products. It is highly needed to efficiently and economically harness the power of heterogeneous GPUs, so that it can meet the requirements of DNN research and development. The paper introduces Poplar, a distributed training system that extends Zero Redundancy Optimizer (ZeRO) with heterogeneous-aware capabilities. We explore a broader spectrum of GPU heterogeneity, including compute capability, memory capacity, quantity and a combination of them. In order to achieve high computational efficiency across all heterogeneous conditions, Poplar conducts fine-grained measurements of GPUs in each ZeRO stage. We propose a novel batch allocation method and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsBrain Tumor Detection and Classification · Advanced Neural Network Applications · Speech Recognition and Synthesis
