Training Speech Enhancement Systems with Noisy Speech Datasets

Koichi Saito; Stefan Uhlich; Giorgio Fabbro; Yuki Mitsufuji

arXiv:2105.12315·eess.AS·May 27, 2021·6 cites

Training Speech Enhancement Systems with Noisy Speech Datasets

Koichi Saito, Stefan Uhlich, Giorgio Fabbro, Yuki Mitsufuji

PDF

Open Access

TL;DR

This paper introduces methods to train speech enhancement systems using noisy datasets by modifying loss functions and augmenting noise, enabling effective training without clean speech data and improving speech quality metrics.

Contribution

The paper presents novel loss function modifications and a noise augmentation scheme for training speech enhancement systems on noisy data, expanding training options.

Findings

01

Improved PESQ scores by up to 0.19 with robust loss functions.

02

Enhanced PESQ scores by up to 0.27 using noise augmentation in MixIT.

03

Demonstrated effectiveness on Mozilla Common Voice dataset.

Abstract

Recently, deep neural network (DNN)-based speech enhancement (SE) systems have been used with great success. During training, such systems require clean speech data - ideally, in large quantity with a variety of acoustic conditions, many different speaker characteristics and for a given sampling rate (e.g., 48kHz for fullband SE). However, obtaining such clean speech data is not straightforward - especially, if only considering publicly available datasets. At the same time, a lot of material for automatic speech recognition (ASR) with the desired acoustic/speaker/sampling rate characteristics is publicly available except being clean, i.e., it also contains background noise as this is even often desired in order to have ASR systems that are noise-robust. Hence, using such data to train SE systems is not straightforward. In this paper, we propose two improvements to train SE systems on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing