Schr\"odinger Bridge Consistency Trajectory Models for Speech Enhancement

Shuichiro Nishigori; Koichi Saito; Naoki Murata; Masato Hirano; Shusuke Takahashi; and Yuki Mitsufuji

arXiv:2507.11925·cs.SD·July 17, 2025

Schr\"odinger Bridge Consistency Trajectory Models for Speech Enhancement

Shuichiro Nishigori, Koichi Saito, Naoki Murata, Masato Hirano, Shusuke Takahashi, and Yuki Mitsufuji

PDF

Open Access

TL;DR

This paper introduces Schr"odinger bridge Consistency Trajectory Models (SBCTM) for speech enhancement, significantly accelerating inference speed while maintaining high speech quality by applying consistency trajectory techniques and novel loss functions.

Contribution

The paper proposes SBCTM, combining Schr"odinger bridge and consistency trajectory models, with a new auxiliary loss, to improve inference speed and quality in speech enhancement.

Findings

01

Achieves approximately 16x faster inference than traditional Schr"odinger bridge models.

02

Maintains a favorable quality-speed trade-off with limited multi-step refinement.

03

Provides open-source code, models, and audio samples for reproducibility.

Abstract

Speech enhancement (SE) utilizing diffusion models is a promising technology that improves speech quality in noisy speech data. Furthermore, the Schr\"odinger bridge (SB) has recently been used in diffusion-based SE to improve speech quality by resolving a mismatch between the endpoint of the forward process and the starting point of the reverse process. However, the SB still exhibits slow inference owing to the necessity of a large number of function evaluations (NFE) for inference to obtain high-quality results. While Consistency Models (CMs) address this issue by employing consistency training that uses distillation from pretrained models in the field of image generation, it does not improve generation quality when the number of steps increases. As a solution to this problem, Consistency Trajectory Models (CTMs) not only accelerate inference speed but also maintain a favorable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis