Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI:   Characterization, Designs, and Performance Evaluation

Ammar Ahmad Awan; Jeroen Bedorf; Ching-Hsiang Chu; Hari Subramoni; and; Dhabaleswar K. Panda

arXiv:1810.11112·cs.DC·November 14, 2019

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Ammar Ahmad Awan, Jeroen Bedorf, Ching-Hsiang Chu, Hari Subramoni, and, Dhabaleswar K. Panda

PDF

TL;DR

This paper thoroughly analyzes distributed DNN training methods using TensorFlow and CUDA-aware MPI, proposing a new MPI-based Allreduce design that significantly improves performance and scalability on GPU clusters.

Contribution

It provides an in-depth performance characterization of existing approaches and introduces a CUDA-aware MPI Allreduce design that enhances efficiency and scalability.

Findings

01

No-gRPC approaches outperform gRPC-based methods in most configurations.

02

The performance of No-gRPC methods is heavily dependent on gradient aggregation efficiency.

03

The proposed MPI Allreduce achieves 5-17X better performance than NCCL2 for small/medium messages.

Abstract

TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.