Reducing the Gap Between Pretrained Speech Enhancement and Recognition Models Using a Real Speech-Trained Bridging Module
Zhongjian Cui, Chenrui Cui, Tianrui Wang, Mengnan He, Hao Shi, Meng, Ge, Caixia Gong, Longbiao Wang, Jianwu Dang

TL;DR
This paper introduces a training strategy for a bridging module in speech enhancement that uses real noisy speech and perceptual quality metrics to better align speech enhancement with recognition performance, significantly improving real-world ASR results.
Contribution
The paper proposes a novel training approach for the bridging module using real noisy speech and multi-task learning, reducing the gap between speech enhancement and recognition models.
Findings
Significant WER reduction on CHiME-4 dataset
Improved robustness of the bridging module with real noisy speech
Enhanced ASR performance compared to simulated data training
Abstract
The information loss or distortion caused by single-channel speech enhancement (SE) harms the performance of automatic speech recognition (ASR). Observation addition (OA) is an effective post-processing method to improve ASR performance by balancing noisy and enhanced speech. Determining the OA coefficient is crucial. However, the currently supervised OA coefficient module, called the bridging module, only utilizes simulated noisy speech for training, which has a severe mismatch with real noisy speech. In this paper, we propose training strategies to train the bridging module with real noisy speech. First, DNSMOS is selected to evaluate the perceptual quality of real noisy speech with no need for the corresponding clean label to train the bridging module. Additional constraints during training are introduced to enhance the robustness of the bridging module further. Each utterance is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Speech and dialogue systems
