Training Distributed Deep Recurrent Neural Networks with Mixed Precision   on GPU Clusters

Alexey Svyatkovskiy; Julian Kates-Harbeck; William Tang

arXiv:1912.00286·cs.LG·December 3, 2019

Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters

Alexey Svyatkovskiy, Julian Kates-Harbeck, William Tang

PDF

TL;DR

This paper demonstrates efficient distributed training of deep recurrent neural networks using mixed precision on GPU clusters, achieving linear scaling and reduced resource usage without sacrificing accuracy.

Contribution

It introduces a distributed, mixed-precision training method with a new learning rate schedule, enabling scalable training of large RNNs across multiple GPUs.

Findings

01

Linear runtime and logarithmic communication scaling observed.

02

Half-precision training reduces memory and bandwidth requirements.

03

State-of-the-art models trained with over 70 million parameters achieve comparable accuracy.

Abstract

In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution across multiple GPU nodes and making use of high-speed interconnects. We introduce a learning rate schedule facilitating neural network convergence at up to $O (100)$ workers. Strong scaling tests performed on clusters of NVIDIA Pascal P100 GPUs show linear runtime and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions, and the benchmark Large Movie Review Dataset~\cite{imdb}. Half-precision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTest