Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR
Yufeng Yang, Ashutosh Pandey, and DeLiang Wang

TL;DR
This paper proposes decoupling speech enhancement from recognition in monaural robust ASR using novel neural models, leading to significant improvements in noisy and reverberant environments without retraining on noisy data.
Contribution
It introduces ARN and CrossNet models that fully separate frontend enhancement from backend ASR, achieving state-of-the-art results without training on noisy speech.
Findings
Outperforms baseline models trained on corrupted speech.
Reduces WER on CHiME-2 by 28.4% relative.
Achieves low WER on CHiME-4 without training on its data.
Abstract
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. The proposed systems fully decouple frontend enhancement and backend ASR trained only on clean speech. Results on the WSJ, CHiME-2, LibriSpeech, and CHiME-4 corpora demonstrate that ARN and CrossNet enhanced speech both translate to improved ASR results in noisy and reverberant environments, and generalize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUltrasonics and Acoustic Wave Propagation · Biometric Identification and Security
