Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition
Kuan-Chen Wang, You-Jin Li, Wei-Lun Chen, Yu-Wen Chen, Yi-Ching Wang,, Ping-Cheng Yeh, Chao Zhang, and Yu Tsao

TL;DR
This paper presents a simple post-processing technique that effectively integrates pre-trained speech enhancement and recognition models, significantly improving robustness in noisy conditions without requiring additional training.
Contribution
A lightweight bridge module and observation addition technique are introduced to seamlessly connect pre-trained SE and ASR models, enhancing noise robustness without fine-tuning.
Findings
Significant improvement in ASR robustness across datasets.
No prior knowledge of speech content needed during training or inference.
Method is compatible with diverse pre-trained models and datasets.
Abstract
Noise robustness is critical when applying automatic speech recognition (ASR) in real-world scenarios. One solution involves the used of speech enhancement (SE) models as the front end of ASR. However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study introduces a simple yet effective SE post-processing technique to address the gap between various pre-trained SE and ASR models. A bridge module, which is a lightweight NN, is proposed to evaluate the signal-level information of the speech signal. Subsequently, using the signal-level information, the observation addition technique is applied to effectively reduce the shortcomings of SE. The experimental results demonstrate the success of our method in integrating diverse pre-trained SE and ASR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
