A Novel Co-design Peta-scale Heterogeneous Cluster for Deep Learning Training
Xin Chen, Hua Zhou, Yuxiang Gao, Yu Zhu

TL;DR
This paper introduces Manoa, a peta-scale heterogeneous GPU cluster, and MiMatrix, a job server framework with a novel AllReduce algorithm, GDRAA, to enhance deep learning training efficiency and scalability.
Contribution
It presents a co-designed distributed system with a new AllReduce algorithm that reduces communication bottlenecks in large-scale deep learning training.
Findings
Achieved state-of-the-art performance on ResNet50 and ResNet101 benchmarks.
Proposed a bandwidth-efficient AllReduce algorithm, GDRAA.
Demonstrated effective utilization of heterogeneous GPU cluster for deep learning.
Abstract
Large scale deep Convolution Neural Networks (CNNs) increasingly demands the computing power. It is key for researchers to own a great powerful computing platform to leverage deep learning (DL) advancing.On the other hand, as the commonly-used accelerator, the commodity GPUs cards of new generations are more and more expensive. Consequently, it is of importance to design an affordable distributed heterogeneous system that provides powerful computational capacity and develop a well-suited software that efficiently utilizes its computational capacity. In this paper, we present our co-design distributed system including a peta-scale GPU cluster, called "Manoa". Based on properties and topology of Manoa, we first propose job server framework and implement it, named "MiMatrix". The central node of MiMatrix, referred to as the job server, undertakes all of controlling, scheduling and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Stochastic Gradient Optimization Techniques
MethodsConvolution
