Mask-dependent Phase Estimation for Monaural Speaker Separation

Zhaoheng Ni; Michael I Mandel

arXiv:1911.02746·eess.AS·April 16, 2020·ICASSP

Mask-dependent Phase Estimation for Monaural Speaker Separation

Zhaoheng Ni, Michael I Mandel

PDF

Open Access

TL;DR

This paper introduces a phase estimation network that improves monaural speaker separation by predicting phase information based on a T-F mask, addressing phase mismatch issues in traditional methods.

Contribution

It proposes a mask-dependent permutation invariant training criterion and an inverse mask weighted loss for effective phase prediction in speaker separation.

Findings

01

Achieves comparable performance to iterative phase reconstruction methods.

02

Simplifies phase estimation process in monaural speaker separation.

03

Demonstrates effectiveness on WSJ0-2mix dataset.

Abstract

Speaker separation refers to isolating speech of interest in a multi-talker environment. Most methods apply real-valued Time-Frequency (T-F) masks to the mixture Short-Time Fourier Transform (STFT) to reconstruct the clean speech. Hence there is an unavoidable mismatch between the phase of the reconstruction and the original phase of the clean speech. In this paper, we propose a simple yet effective phase estimation network that predicts the phase of the clean speech based on a T-F mask predicted by a chimera++ network. To overcome the label-permutation problem for both the T-F mask and the phase, we propose a mask-dependent permutation invariant training (PIT) criterion to select the phase signal based on the loss from the T-F mask prediction. We also propose an Inverse Mask Weighted Loss Function for phase prediction to focus the model on the T-F regions in which the phase is more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis