Mask-dependent Phase Estimation for Monaural Speaker Separation
Zhaoheng Ni, Michael I Mandel

TL;DR
This paper introduces a phase estimation network that improves monaural speaker separation by predicting phase information based on a T-F mask, addressing phase mismatch issues in traditional methods.
Contribution
It proposes a mask-dependent permutation invariant training criterion and an inverse mask weighted loss for effective phase prediction in speaker separation.
Findings
Achieves comparable performance to iterative phase reconstruction methods.
Simplifies phase estimation process in monaural speaker separation.
Demonstrates effectiveness on WSJ0-2mix dataset.
Abstract
Speaker separation refers to isolating speech of interest in a multi-talker environment. Most methods apply real-valued Time-Frequency (T-F) masks to the mixture Short-Time Fourier Transform (STFT) to reconstruct the clean speech. Hence there is an unavoidable mismatch between the phase of the reconstruction and the original phase of the clean speech. In this paper, we propose a simple yet effective phase estimation network that predicts the phase of the clean speech based on a T-F mask predicted by a chimera++ network. To overcome the label-permutation problem for both the T-F mask and the phase, we propose a mask-dependent permutation invariant training (PIT) criterion to select the phase signal based on the loss from the T-F mask prediction. We also propose an Inverse Mask Weighted Loss Function for phase prediction to focus the model on the T-F regions in which the phase is more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
