Reducing the Gap Between Pretrained Speech Enhancement and Recognition   Models Using a Real Speech-Trained Bridging Module

Zhongjian Cui; Chenrui Cui; Tianrui Wang; Mengnan He; Hao Shi; Meng; Ge; Caixia Gong; Longbiao Wang; Jianwu Dang

arXiv:2501.02452·cs.SD·January 7, 2025

Reducing the Gap Between Pretrained Speech Enhancement and Recognition Models Using a Real Speech-Trained Bridging Module

Zhongjian Cui, Chenrui Cui, Tianrui Wang, Mengnan He, Hao Shi, Meng, Ge, Caixia Gong, Longbiao Wang, Jianwu Dang

PDF

Open Access

TL;DR

This paper introduces a training strategy for a bridging module in speech enhancement that uses real noisy speech and perceptual quality metrics to better align speech enhancement with recognition performance, significantly improving real-world ASR results.

Contribution

The paper proposes a novel training approach for the bridging module using real noisy speech and multi-task learning, reducing the gap between speech enhancement and recognition models.

Findings

01

Significant WER reduction on CHiME-4 dataset

02

Improved robustness of the bridging module with real noisy speech

03

Enhanced ASR performance compared to simulated data training

Abstract

The information loss or distortion caused by single-channel speech enhancement (SE) harms the performance of automatic speech recognition (ASR). Observation addition (OA) is an effective post-processing method to improve ASR performance by balancing noisy and enhanced speech. Determining the OA coefficient is crucial. However, the currently supervised OA coefficient module, called the bridging module, only utilizes simulated noisy speech for training, which has a severe mismatch with real noisy speech. In this paper, we propose training strategies to train the bridging module with real noisy speech. First, DNSMOS is selected to evaluate the perceptual quality of real noisy speech with no need for the corresponding clean label to train the bridging module. Additional constraints during training are introduced to enhance the robustness of the bridging module further. Each utterance is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Speech and dialogue systems