Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters
Alexey Svyatkovskiy, Julian Kates-Harbeck, William Tang

TL;DR
This paper demonstrates efficient distributed training of deep recurrent neural networks using mixed precision on GPU clusters, achieving linear scaling and reduced resource usage without sacrificing accuracy.
Contribution
It introduces a distributed, mixed-precision training method with a new learning rate schedule, enabling scalable training of large RNNs across multiple GPUs.
Findings
Linear runtime and logarithmic communication scaling observed.
Half-precision training reduces memory and bandwidth requirements.
State-of-the-art models trained with over 70 million parameters achieve comparable accuracy.
Abstract
In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution across multiple GPU nodes and making use of high-speed interconnects. We introduce a learning rate schedule facilitating neural network convergence at up to workers. Strong scaling tests performed on clusters of NVIDIA Pascal P100 GPUs show linear runtime and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions, and the benchmark Large Movie Review Dataset~\cite{imdb}. Half-precision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTest
