Time-Domain Speech Enhancement for Robust Automatic Speech Recognition

Yufeng Yang; Ashutosh Pandey; DeLiang Wang

arXiv:2210.13318·eess.AS·June 22, 2023·1 cites

Time-Domain Speech Enhancement for Robust Automatic Speech Recognition

Yufeng Yang, Ashutosh Pandey, DeLiang Wang

PDF

Open Access

TL;DR

This paper introduces a time-domain speech enhancement model using attentive recurrent networks that improves automatic speech recognition accuracy in noisy conditions by decoupling enhancement from the acoustic model, achieving significant WER reduction.

Contribution

The paper presents a novel ARN-based time-domain enhancement system that effectively separates speech enhancement from the acoustic model, improving robustness in ASR tasks.

Findings

01

Achieved 6.28% average WER on CHiME-2

02

Outperformed previous best by 19.3% relatively

03

Demonstrated effectiveness of decoupled enhancement in noisy ASR

Abstract

It has been shown that the intelligibility of noisy speech can be improved by speech enhancement algorithms. However, speech enhancement has not been established as an effective frontend for robust automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between speech enhancement and ASR impedes the progress of robust ASR systems especially as speech enhancement has made big strides in recent years. In this work, we focus on eliminating this divide with an ARN (attentive recurrent network) based time-domain enhancement model. The proposed system fully decouples speech enhancement and an acoustic model trained only on clean speech. Results on the CHiME-2 corpus show that ARN enhanced speech translates to improved ASR results. The proposed system achieves $6.28%$ average word error rate, outperforming the previous best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Indoor and Outdoor Localization Technologies