HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism
Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin, Lee, Jaesik Choi, Sam H. Noh, Young-ri Choi

TL;DR
HetPipe is a system that combines pipelined model parallelism and data parallelism to enable efficient training of large DNNs on heterogeneous GPU clusters, including low-power GPUs, achieving up to 49% faster convergence.
Contribution
HetPipe introduces a novel integration of PMP and DP with a new synchronization model, WSP, and demonstrates improved training speed on heterogeneous GPU clusters.
Findings
Achieves up to 49% faster convergence compared to existing methods.
Successfully integrates PMP and DP for heterogeneous GPU training.
Provides convergence proof for the proposed synchronization model.
Abstract
Deep Neural Network (DNN) models have continuously been growing in size in order to improve the accuracy and quality of the models. Moreover, for training of large DNN models, the use of heterogeneous GPUs is inevitable due to the short release cycle of new GPU architectures. In this paper, we investigate how to enable training of large DNN models on a heterogeneous GPU cluster that possibly includes whimpy GPUs that, as a standalone, could not be used for training. We present a DNN training system, HetPipe (Heterogeneous Pipeline), that integrates pipelined model parallelism (PMP) with data parallelism (DP). In HetPipe, a group of multiple GPUs, called a virtual worker, processes minibatches in a pipelined manner, and multiple such virtual workers employ data parallelism for higher performance. We also propose a novel parameter synchronization model, which we refer to as Wave…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Stochastic Gradient Optimization Techniques
