Robust Speech Recognition with Schr\"odinger Bridge-Based Speech Enhancement
Rauf Nasretdinov, Roman Korostik, Ante Juki\'c

TL;DR
This paper explores a Schr"odinger bridge-based speech enhancement method to improve speech recognition accuracy in noisy environments, demonstrating significant WER reductions compared to baseline approaches.
Contribution
It introduces a novel Schr"odinger bridge-based speech enhancement model for robust ASR, analyzing its scaling, sampling, and comparison with existing methods.
Findings
Reduces WER by ~40% relative to unprocessed speech.
Outperforms predictive approaches by ~8% in WER reduction.
Effective across different pre-trained ASR models.
Abstract
In this work, we investigate application of generative speech enhancement to improve the robustness of ASR models in noisy and reverberant conditions. We employ a recently-proposed speech enhancement model based on Schr\"odinger bridge, which has been shown to perform well compared to diffusion-based approaches. We analyze the impact of model scaling and different sampling methods on the ASR performance. Furthermore, we compare the considered model with predictive and diffusion-based baselines and analyze the speech recognition performance when using different pre-trained ASR models. The proposed approach significantly reduces the word error rate, reducing it by approximately 40% relative to the unprocessed speech signals and by approximately 8% relative to a similarly sized predictive approach.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
