Personalized Speech Enhancement: New Models and Comprehensive Evaluation
Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Xiaofei Wang, Zhuo, Chen, Xuedong Huang

TL;DR
This paper introduces new neural network models for personalized speech enhancement that outperform previous methods, along with comprehensive evaluation metrics and a multi-task training approach to improve speech quality and recognition in video conferencing.
Contribution
The work presents two novel neural network models for PSE, a new metric for TSOS, and a multi-task training framework with speech recognition, advancing the state-of-the-art in personalized speech enhancement.
Findings
Proposed models outperform VoiceFilter in speech quality and recognition accuracy.
New TSOS metric effectively measures over-suppression issues.
Multi-task training reduces TSOS and enhances speech recognition.
Abstract
Personalized speech enhancement (PSE) models utilize additional cues, such as speaker embeddings like d-vectors, to remove background noise and interfering speech in real-time and thus improve the speech quality of online video conferencing systems for various acoustic scenarios. In this work, we propose two neural networks for PSE that achieve superior performance to the previously proposed VoiceFilter. In addition, we create test sets that capture a variety of scenarios that users can encounter during video conferencing. Furthermore, we propose a new metric to measure the target speaker over-suppression (TSOS) problem, which was not sufficiently investigated before despite its critical importance in deployment. Besides, we propose multi-task training with a speech recognition back-end. Our results show that the proposed models can yield better speech recognition accuracy, speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsTest
