HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow
Ammar Ahmad Awan, Arpan Jain, Quentin Anthony, Hari Subramoni, and, Dhabaleswar K. Panda

TL;DR
HyPar-Flow is a scalable, hybrid-parallel training system for deep neural networks that combines MPI, Keras, and TensorFlow to improve training speed and efficiency across large HPC systems.
Contribution
It introduces a model-agnostic, user-transparent system for hybrid-parallel DNN training that addresses key challenges in distributed model definition, communication, and scalability.
Findings
Achieves up to 1.6x speedup over Horovod data-parallel training.
Provides 110x speedup on 128 nodes of Stampede2.
Attains 481x speedup on 512 nodes of Frontera.
Abstract
To reduce training time of large-scale DNNs, scientists have started to explore parallelization strategies like data-parallelism, model-parallelism, and hybrid-parallelism. While data-parallelism has been extensively studied and developed, several problems exist in realizing model-parallelism and hybrid-parallelism efficiently. Four major problems we focus on are: 1) defining a notion of a distributed model across processes, 2) implementing forward/back-propagation across process boundaries that requires explicit communication, 3) obtaining parallel speedup on an inherently sequential task, and 4) achieving scalability without losing out on a model's accuracy. To address these problems, we create HyPar-Flow --- a model-size/-type agnostic, scalable, practical, and user-transparent system for hybrid-parallel training by exploiting MPI, Keras, and TensorFlow. HyPar-Flow provides a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Data Storage Technologies · Parallel Computing and Optimization Techniques
