TL;DR
This paper introduces a novel generative speech enhancement model based on Schr"odinger bridge theory, which outperforms diffusion models in speech quality and ASR performance while being more computationally efficient.
Contribution
The paper develops a Schr"odinger bridge-based model for speech enhancement, offering a new data-to-data process formulation that improves quality and efficiency over existing diffusion models.
Findings
Outperforms diffusion models in speech quality metrics
Reduces word error rate by 20% in denoising and 6% in dereverberation
Achieves better quality with fewer sampling steps and lower computational cost
Abstract
This paper proposes a generative speech enhancement model based on Schr\"odinger bridge (SB). The proposed model is employing a tractable SB to formulate a data-to-data process between the clean speech distribution and the observed noisy speech distribution. The model is trained with a data prediction loss, aiming to recover the complex-valued clean speech coefficients, and an auxiliary time-domain loss is used to improve training of the model. The effectiveness of the proposed SB-based model is evaluated in two different speech enhancement tasks: speech denoising and speech dereverberation. The experimental results demonstrate that the proposed SB-based outperforms diffusion-based models in terms of speech quality metrics and ASR performance, e.g., resulting in relative word error rate reduction of 20% for denoising and 6% for dereverberation compared to the best baseline model. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
