Dual-path RNN: efficient long sequence modeling for time-domain   single-channel speech separation

Yi Luo; Zhuo Chen; Takuya Yoshioka

arXiv:1910.06379·eess.AS·March 30, 2020·50 cites

Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation

Yi Luo, Zhuo Chen, Takuya Yoshioka

PDF

Open Access 5 Repos

TL;DR

This paper introduces the dual-path RNN (DPRNN), a novel architecture that efficiently models extremely long sequences in time-domain speech separation, achieving state-of-the-art results with a smaller model size.

Contribution

The paper proposes DPRNN, a new RNN-based structure that splits long sequences into chunks for effective intra- and inter-chunk processing, improving long sequence modeling in speech separation.

Findings

01

Achieves state-of-the-art performance on WSJ0-2mix dataset.

02

Uses 20 times fewer parameters than previous best models.

03

Replaces CNN with DPRNN in TasNet for better long sequence modeling.

Abstract

Recent studies in deep learning-based speech separation have proven the superiority of time-domain approaches to conventional time-frequency-based methods. Unlike the time-frequency domain approaches, the time-domain separation systems often receive input sequences consisting of a huge number of time steps, which introduces challenges for modeling extremely long sequences. Conventional recurrent neural networks (RNNs) are not effective for modeling such long sequences due to optimization difficulties, while one-dimensional convolutional neural networks (1-D CNNs) cannot perform utterance-level sequence modeling when its receptive field is smaller than the sequence length. In this paper, we propose dual-path recurrent neural network (DPRNN), a simple yet effective method for organizing RNN layers in a deep structure to model extremely long sequences. DPRNN splits the long sequential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Bidirectional LSTM