Exploring the Best Loss Function for DNN-Based Low-latency Speech Enhancement with Temporal Convolutional Networks
Yuichiro Koyama, Tyler Vuong, Stefan Uhlich, Bhiksha Raj

TL;DR
This paper compares different loss functions and methods for DNN-based speech enhancement, proposing a new STFT-based approach with PASE features and a low-latency TasNet variant, demonstrating superior performance on multiple datasets.
Contribution
It introduces a novel STFT-based loss function with PASE features and a low-latency TasNet model, advancing DNN-based speech enhancement techniques.
Findings
Proposed STFT-based method with PASE features improves subjective quality on small datasets.
Low-latency TasNet achieves excellent performance in the DNS Challenge.
Methods compare favorably to state-of-the-art approaches.
Abstract
Recently, deep neural networks (DNNs) have been successfully used for speech enhancement, and DNN-based speech enhancement is becoming an attractive research area. While time-frequency masking based on the short-time Fourier transform (STFT) has been widely used for DNN-based speech enhancement over the last years, time domain methods such as the time-domain audio separation network (TasNet) have also been proposed. The most suitable method depends on the scale of the dataset and the type of task. In this paper, we explore the best speech enhancement algorithm on two different datasets. We propose a STFT-based method and a loss function using problem-agnostic speech encoder (PASE) features to improve subjective quality for the smaller dataset. Our proposed methods are effective on the Voice Bank + DEMAND dataset and compare favorably to other state-of-the-art methods. We also implement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
