End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics   Optimization by Fully Convolutional Neural Networks

Szu-Wei Fu; Tao-Wei Wang; Yu Tsao; Xugang Lu; and Hisashi Kawai

arXiv:1709.03658·stat.ML·March 16, 2018·20 cites

End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks

Szu-Wei Fu, Tao-Wei Wang, Yu Tsao, Xugang Lu, and Hisashi Kawai

PDF

Open Access

TL;DR

This paper introduces an end-to-end speech enhancement method using fully convolutional neural networks that directly optimizes evaluation metrics like STOI, leading to improved speech intelligibility and ASR performance.

Contribution

It proposes a novel utterance-based optimization framework with FCNs to align training objectives with evaluation metrics, enhancing speech enhancement effectiveness.

Findings

01

STOI scores improved over conventional MMSE-optimized speech

02

Enhanced speech shows better intelligibility for human listeners

03

Automatic speech recognition accuracy is substantially increased

Abstract

Speech enhancement model is used to map a noisy speech to a clean speech. In the training stage, an objective function is often adopted to optimize the model parameters. However, in most studies, there is an inconsistency between the model optimization criterion and the evaluation criterion on the enhanced speech. For example, in measuring speech intelligibility, most of the evaluation metric is based on a short-time objective intelligibility (STOI) measure, while the frame based minimum mean square error (MMSE) between estimated and clean speech is widely used in optimizing the model. Due to the inconsistency, there is no guarantee that the trained model can provide optimal performance in applications. In this study, we propose an end-to-end utterance-based speech enhancement framework using fully convolutional neural networks (FCN) to reduce the gap between the model optimization and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation

MethodsMax Pooling · Convolution · Fully Convolutional Network