Joint Separation and Denoising of Noisy Multi-talker Speech using   Recurrent Neural Networks and Permutation Invariant Training

Morten Kolb{\ae}k; Dong Yu; Zheng-Hua Tan; Jesper Jensen

arXiv:1708.09588·cs.SD·December 6, 2018·6 cites

Joint Separation and Denoising of Noisy Multi-talker Speech using Recurrent Neural Networks and Permutation Invariant Training

Morten Kolb{\ae}k, Dong Yu, Zheng-Hua Tan, Jesper Jensen

PDF

Open Access

TL;DR

This paper introduces a method using recurrent neural networks trained with permutation invariant training to effectively separate and denoise multi-talker speech in noisy environments, demonstrating robustness across noise types and speaker counts.

Contribution

The study presents a novel application of utterance-level permutation invariant training with bi-directional LSTM RNNs for simultaneous speech separation and denoising, capable of handling unknown noise types and varying speaker numbers.

Findings

01

LSTM RNNs trained with uPIT significantly improve SDR and ESTOI in noisy conditions.

02

A single model effectively handles multiple noise types with minimal performance loss.

03

The approach generalizes well to unseen noise types and different numbers of speakers.

Abstract

In this paper we propose to use utterance-level Permutation Invariant Training (uPIT) for speaker independent multi-talker speech separation and denoising, simultaneously. Specifically, we train deep bi-directional Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for single-channel speaker independent multi-talker speech separation in multiple noisy conditions, including both synthetic and real-life noise signals. We focus our experiments on generalizability and noise robustness of models that rely on various types of a priori knowledge e.g. in terms of noise type and number of simultaneous speakers. We show that deep bi-directional LSTM RNNs trained using uPIT in noisy environments can improve the Signal-to-Distortion Ratio (SDR) as well as the Extended Short-Time Objective Intelligibility (ESTOI) measure, on the speaker independent multi-talker speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis