Optimizing Performance of Recurrent Neural Networks on GPUs
Jeremy Appleyard, Tomas Kocisky, Phil Blunsom

TL;DR
This paper presents a series of optimization techniques for recurrent neural networks on GPUs, achieving significant speedups by exposing parallelism at multiple levels within the network.
Contribution
The paper introduces a three-stage optimization process integrated into NVIDIA's cuDNN to significantly improve RNN training performance on GPUs.
Findings
Achieved an order of magnitude speedup over naive implementations.
Optimizations include cell, layer, and network-level parallelism.
Implemented optimizations in NVIDIA's cuDNN library.
Abstract
As recurrent neural networks become larger and deeper, training times for single networks are rising into weeks or even months. As such there is a significant incentive to improve the performance and scalability of these networks. While GPUs have become the hardware of choice for training and deploying recurrent models, the implementations employed often make use of only basic optimizations for these architectures. In this article we demonstrate that by exposing parallelism between operations within the network, an order of magnitude speedup across a range of network sizes can be achieved over a naive implementation. We describe three stages of optimization that have been incorporated into the fifth release of NVIDIA's cuDNN: firstly optimizing a single cell, secondly a single layer, and thirdly the entire network.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
