On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement
Morten Kolb{\ae}k, Zheng-Hua Tan, S{\o}ren Holdt Jensen, Jesper Jensen

TL;DR
This paper investigates how different loss functions affect the performance of deep learning-based monaural time-domain speech enhancement, highlighting the advantages of perceptually inspired losses and SI-SDR for better speech quality.
Contribution
It provides a comprehensive analysis of loss functions in time-domain speech enhancement, emphasizing perceptual losses, the importance of learning rate, and the effectiveness of SI-SDR as a general-purpose loss.
Findings
Perceptually inspired loss functions may improve speech quality for human listeners.
The learning rate significantly impacts training effectiveness in speech enhancement models.
SI-SDR-based loss performs well across multiple evaluation metrics.
Abstract
Many deep learning-based speech enhancement algorithms are designed to minimize the mean-square error (MSE) in some transform domain between a predicted and a target speech signal. However, optimizing for MSE does not necessarily guarantee high speech quality or intelligibility, which is the ultimate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of the loss function on the emerging class of time-domain deep learning-based speech enhancement systems. We study how popular loss functions influence the performance of deep learning-based speech enhancement systems. First, we demonstrate that perceptually inspired loss functions might be advantageous if the receiver is the human auditory system. Furthermore, we show that the learning rate is a crucial design parameter even for adaptive gradient-based optimizers, which has been generally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
