Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss
Ziqiang Shi, Rujie Liu, Jiqing Han

TL;DR
This paper introduces TasTas, a novel multi-stage dual-path BiLSTM network with auxiliary identity loss, achieving state-of-the-art results in monaural speech separation by iteratively refining separated signals and enforcing speaker identity consistency.
Contribution
The work extends dual-path BiLSTM with multi-stage refinement and identity loss, significantly improving speech separation performance on benchmark datasets.
Findings
Achieved 20.55dB SDR improvement on WSJ0-2mix
Attained 20.35dB SI-SDR improvement
Reached 94.86% ESTOI accuracy
Abstract
Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation. This work investigates how to extend dual-path BiLSTM to result in a new state-of-the-art approach, called TasTas, for multi-talker monaural speech separation (a.k.a cocktail party problem). TasTas introduces two simple but effective improvements, one is an iterative multi-stage refinement scheme, and the other is to correct the speech with imperfect separation through a loss of speaker identity consistency between the separated speech and original speech, to boost the performance of dual-path BiLSTM based networks. TasTas takes the mixed utterance of two speakers and maps it to two separated utterances, where each utterance contains only one speaker's voice. Our experiments on the notable benchmark WSJ0-2mix…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
