Bridging the Gap: Integrating Pre-trained Speech Enhancement and   Recognition Models for Robust Speech Recognition

Kuan-Chen Wang; You-Jin Li; Wei-Lun Chen; Yu-Wen Chen; Yi-Ching Wang,; Ping-Cheng Yeh; Chao Zhang; and Yu Tsao

arXiv:2406.12699·cs.SD·June 19, 2024

Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition

Kuan-Chen Wang, You-Jin Li, Wei-Lun Chen, Yu-Wen Chen, Yi-Ching Wang,, Ping-Cheng Yeh, Chao Zhang, and Yu Tsao

PDF

Open Access

TL;DR

This paper presents a simple post-processing technique that effectively integrates pre-trained speech enhancement and recognition models, significantly improving robustness in noisy conditions without requiring additional training.

Contribution

A lightweight bridge module and observation addition technique are introduced to seamlessly connect pre-trained SE and ASR models, enhancing noise robustness without fine-tuning.

Findings

01

Significant improvement in ASR robustness across datasets.

02

No prior knowledge of speech content needed during training or inference.

03

Method is compatible with diverse pre-trained models and datasets.

Abstract

Noise robustness is critical when applying automatic speech recognition (ASR) in real-world scenarios. One solution involves the used of speech enhancement (SE) models as the front end of ASR. However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study introduces a simple yet effective SE post-processing technique to address the gap between various pre-trained SE and ASR models. A bridge module, which is a lightweight NN, is proposed to evaluate the signal-level information of the speech signal. Subsequently, using the signal-level information, the observation addition technique is applied to effectively reduce the shortcomings of SE. The experimental results demonstrate the success of our method in integrating diverse pre-trained SE and ASR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing