End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks
Szu-Wei Fu, Tao-Wei Wang, Yu Tsao, Xugang Lu, and Hisashi Kawai

TL;DR
This paper introduces an end-to-end speech enhancement method using fully convolutional neural networks that directly optimizes evaluation metrics like STOI, leading to improved speech intelligibility and ASR performance.
Contribution
It proposes a novel utterance-based optimization framework with FCNs to align training objectives with evaluation metrics, enhancing speech enhancement effectiveness.
Findings
STOI scores improved over conventional MMSE-optimized speech
Enhanced speech shows better intelligibility for human listeners
Automatic speech recognition accuracy is substantially increased
Abstract
Speech enhancement model is used to map a noisy speech to a clean speech. In the training stage, an objective function is often adopted to optimize the model parameters. However, in most studies, there is an inconsistency between the model optimization criterion and the evaluation criterion on the enhanced speech. For example, in measuring speech intelligibility, most of the evaluation metric is based on a short-time objective intelligibility (STOI) measure, while the frame based minimum mean square error (MMSE) between estimated and clean speech is widely used in optimizing the model. Due to the inconsistency, there is no guarantee that the trained model can provide optimal performance in applications. In this study, we propose an end-to-end utterance-based speech enhancement framework using fully convolutional neural networks (FCN) to reduce the gap between the model optimization and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation
MethodsMax Pooling · Convolution · Fully Convolutional Network
