Conditional Diffusion Model for Target Speaker Extraction

Theodor Nguyen; Guangzhi Sun; Xianrui Zheng; Chao Zhang; Philip C; Woodland

arXiv:2310.04791·eess.AS·October 10, 2023·2 cites

Conditional Diffusion Model for Target Speaker Extraction

Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C, Woodland

PDF

Open Access

TL;DR

This paper introduces DiffSpEx, a novel generative diffusion-based method for extracting target speakers from mixed audio, leveraging stochastic differential equations and speaker embeddings, with promising results on standard datasets.

Contribution

The paper presents DiffSpEx, a new diffusion model for target speaker extraction that incorporates speaker embeddings and achieves improved performance and personalization capabilities.

Findings

01

Achieves 12.9 dB SI-SDR on WSJ0-2mix

02

Demonstrates effective speaker personalization through fine-tuning

03

Scores 3.56 on NISQA quality metric

Abstract

We propose DiffSpEx, a generative target speaker extraction method based on score-based generative modelling through stochastic differential equations. DiffSpEx deploys a continuous-time stochastic diffusion process in the complex short-time Fourier transform domain, starting from the target speaker source and converging to a Gaussian distribution centred on the mixture of sources. For the reverse-time process, a parametrised score function is conditioned on a target speaker embedding to extract the target speaker from the mixture of sources. We utilise ECAPA-TDNN target speaker embeddings and condition the score function alternately on the SDE time embedding and the target speaker embedding. The potential of DiffSpEx is demonstrated with the WSJ0-2mix dataset, achieving an SI-SDR of 12.9 dB and a NISQA score of 3.56. Moreover, we show that fine-tuning a pre-trained DiffSpEx model to a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsDiffusion