Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation
Yi Luo, Zhuo Chen, Takuya Yoshioka

TL;DR
This paper introduces the dual-path RNN (DPRNN), a novel architecture that efficiently models extremely long sequences in time-domain speech separation, achieving state-of-the-art results with a smaller model size.
Contribution
The paper proposes DPRNN, a new RNN-based structure that splits long sequences into chunks for effective intra- and inter-chunk processing, improving long sequence modeling in speech separation.
Findings
Achieves state-of-the-art performance on WSJ0-2mix dataset.
Uses 20 times fewer parameters than previous best models.
Replaces CNN with DPRNN in TasNet for better long sequence modeling.
Abstract
Recent studies in deep learning-based speech separation have proven the superiority of time-domain approaches to conventional time-frequency-based methods. Unlike the time-frequency domain approaches, the time-domain separation systems often receive input sequences consisting of a huge number of time steps, which introduces challenges for modeling extremely long sequences. Conventional recurrent neural networks (RNNs) are not effective for modeling such long sequences due to optimization difficulties, while one-dimensional convolutional neural networks (1-D CNNs) cannot perform utterance-level sequence modeling when its receptive field is smaller than the sequence length. In this paper, we propose dual-path recurrent neural network (DPRNN), a simple yet effective method for organizing RNN layers in a deep structure to model extremely long sequences. DPRNN splits the long sequential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Bidirectional LSTM
