Single stream parallelization of generalized LSTM-like RNNs on a GPU
Kyuyeon Hwang, Wonyong Sung

TL;DR
This paper introduces a graph-based parallelization method for generalized RNNs, including LSTMs, enabling faster training on GPUs by exploiting parallelism within a single training stream and multiple streams.
Contribution
It proposes a novel graph-based RNN structure and an automatic parallelization approach that enhances training speed on GPUs, applicable to various RNN architectures.
Findings
Significant speed-up achieved with a single training stream.
Further acceleration when combining multiple parallel training streams.
Effective parallelization of RNN training on GPU demonstrated.
Abstract
Recurrent neural networks (RNNs) have shown outstanding performance on processing sequence data. However, they suffer from long training time, which demands parallel implementations of the training procedure. Parallelization of the training algorithms for RNNs are very challenging because internal recurrent paths form dependencies between two different time frames. In this paper, we first propose a generalized graph-based RNN structure that covers the most popular long short-term memory (LSTM) network. Then, we present a parallelization approach that automatically explores parallelisms of arbitrary RNNs by analyzing the graph structure. The experimental results show that the proposed approach shows great speed-up even with a single training stream, and further accelerates the training when combined with multiple parallel training streams.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
