Effect of noise suppression losses on speech distortion and ASR performance
Sebastian Braun, Hannes Gamper

TL;DR
This paper investigates how different noise suppression loss functions affect speech quality and ASR performance, revealing that complex spectral losses and pre-trained predictors have limited impact on improving outcomes.
Contribution
The study analyzes the effects of spectral complex MSE loss and pre-trained predictors on speech enhancement and ASR, providing insights into their effectiveness and limitations.
Findings
Complex spectral MSE loss influences speech distortion and noise reduction trade-off.
Pre-trained MOS and WER predictors do not significantly improve speech quality or recognition.
Spectral loss remains a strong baseline for speech enhancement tasks.
Abstract
Deep learning based speech enhancement has made rapid development towards improving quality, while models are becoming more compact and usable for real-time on-the-edge inference. However, the speech quality scales directly with the model size, and small models are often still unable to achieve sufficient quality. Furthermore, the introduced speech distortion and artifacts greatly harm speech quality and intelligibility, and often significantly degrade automatic speech recognition (ASR) rates. In this work, we shed light on the success of the spectral complex compressed mean squared error (MSE) loss, and how its magnitude and phase-aware terms are related to the speech distortion vs. noise reduction trade off. We further investigate integrating pre-trained reference-less predictors for mean opinion score (MOS) and word error rate (WER), and pre-trained embeddings on ASR and sound event…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
