Joint Separation and Denoising of Noisy Multi-talker Speech using Recurrent Neural Networks and Permutation Invariant Training
Morten Kolb{\ae}k, Dong Yu, Zheng-Hua Tan, Jesper Jensen

TL;DR
This paper introduces a method using recurrent neural networks trained with permutation invariant training to effectively separate and denoise multi-talker speech in noisy environments, demonstrating robustness across noise types and speaker counts.
Contribution
The study presents a novel application of utterance-level permutation invariant training with bi-directional LSTM RNNs for simultaneous speech separation and denoising, capable of handling unknown noise types and varying speaker numbers.
Findings
LSTM RNNs trained with uPIT significantly improve SDR and ESTOI in noisy conditions.
A single model effectively handles multiple noise types with minimal performance loss.
The approach generalizes well to unseen noise types and different numbers of speakers.
Abstract
In this paper we propose to use utterance-level Permutation Invariant Training (uPIT) for speaker independent multi-talker speech separation and denoising, simultaneously. Specifically, we train deep bi-directional Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for single-channel speaker independent multi-talker speech separation in multiple noisy conditions, including both synthetic and real-life noise signals. We focus our experiments on generalizability and noise robustness of models that rely on various types of a priori knowledge e.g. in terms of noise type and number of simultaneous speakers. We show that deep bi-directional LSTM RNNs trained using uPIT in noisy environments can improve the Signal-to-Distortion Ratio (SDR) as well as the Extended Short-Time Objective Intelligibility (ESTOI) measure, on the speaker independent multi-talker speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
