Robust Speech Recognition with Schr\"odinger Bridge-Based Speech   Enhancement

Rauf Nasretdinov; Roman Korostik; Ante Juki\'c

arXiv:2505.04237·eess.AS·May 9, 2025

Robust Speech Recognition with Schr\"odinger Bridge-Based Speech Enhancement

Rauf Nasretdinov, Roman Korostik, Ante Juki\'c

PDF

TL;DR

This paper explores a Schr"odinger bridge-based speech enhancement method to improve speech recognition accuracy in noisy environments, demonstrating significant WER reductions compared to baseline approaches.

Contribution

It introduces a novel Schr"odinger bridge-based speech enhancement model for robust ASR, analyzing its scaling, sampling, and comparison with existing methods.

Findings

01

Reduces WER by ~40% relative to unprocessed speech.

02

Outperforms predictive approaches by ~8% in WER reduction.

03

Effective across different pre-trained ASR models.

Abstract

In this work, we investigate application of generative speech enhancement to improve the robustness of ASR models in noisy and reverberant conditions. We employ a recently-proposed speech enhancement model based on Schr\"odinger bridge, which has been shown to perform well compared to diffusion-based approaches. We analyze the impact of model scaling and different sampling methods on the ASR performance. Furthermore, we compare the considered model with predictive and diffusion-based baselines and analyze the speech recognition performance when using different pre-trained ASR models. The proposed approach significantly reduces the word error rate, reducing it by approximately 40% relative to the unprocessed speech signals and by approximately 8% relative to a similarly sized predictive approach.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.