DDTSE: Discriminative Diffusion Model for Target Speech Extraction

Leying Zhang; Yao Qian; Linfeng Yu; Heming Wang; Hemin Yang; Long; Zhou; Shujie Liu; Yanmin Qian

arXiv:2309.13874·eess.AS·October 8, 2024

DDTSE: Discriminative Diffusion Model for Target Speech Extraction

Leying Zhang, Yao Qian, Linfeng Yu, Heming Wang, Hemin Yang, Long, Zhou, Shujie Liu, Yanmin Qian

PDF

Open Access

TL;DR

This paper introduces DDTSE, a novel discriminative diffusion model for target speech extraction that improves quality and speeds up inference, especially in multi-speaker noisy environments, without needing retraining of existing models.

Contribution

The paper proposes a new diffusion-based approach with a two-stage training strategy for target speech extraction, enhancing performance and inference speed.

Findings

01

Achieves higher perceptual speech quality

02

Speeds up inference by 3 times

03

Can enhance existing discriminative models

Abstract

Diffusion models have gained attention in speech enhancement tasks, providing an alternative to conventional discriminative methods. However, research on target speech extraction under multi-speaker noisy conditions remains relatively unexplored. Moreover, the superior quality of diffusion methods typically comes at the cost of slower inference speed. In this paper, we introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE). We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods. Furthermore, we devise a two-stage training strategy to emulate the inference process during model training. DDTSE not only works as a standalone system, but also can further improve the performance of discriminative models without additional retraining. Experimental results demonstrate that DDTSE not only achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsDiffusion