DDTSE: Discriminative Diffusion Model for Target Speech Extraction
Leying Zhang, Yao Qian, Linfeng Yu, Heming Wang, Hemin Yang, Long, Zhou, Shujie Liu, Yanmin Qian

TL;DR
This paper introduces DDTSE, a novel discriminative diffusion model for target speech extraction that improves quality and speeds up inference, especially in multi-speaker noisy environments, without needing retraining of existing models.
Contribution
The paper proposes a new diffusion-based approach with a two-stage training strategy for target speech extraction, enhancing performance and inference speed.
Findings
Achieves higher perceptual speech quality
Speeds up inference by 3 times
Can enhance existing discriminative models
Abstract
Diffusion models have gained attention in speech enhancement tasks, providing an alternative to conventional discriminative methods. However, research on target speech extraction under multi-speaker noisy conditions remains relatively unexplored. Moreover, the superior quality of diffusion methods typically comes at the cost of slower inference speed. In this paper, we introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE). We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods. Furthermore, we devise a two-stage training strategy to emulate the inference process during model training. DDTSE not only works as a standalone system, but also can further improve the performance of discriminative models without additional retraining. Experimental results demonstrate that DDTSE not only achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsDiffusion
