DualStream Contextual Fusion Network: Efficient Target Speaker   Extraction by Leveraging Mixture and Enrollment Interactions

Ke Xue; Rongfei Fan; Shanping Yu; Chang Sun; Jianping An

arXiv:2502.08191·cs.SD·February 13, 2025

DualStream Contextual Fusion Network: Efficient Target Speaker Extraction by Leveraging Mixture and Enrollment Interactions

Ke Xue, Rongfei Fan, Shanping Yu, Chang Sun, Jianping An

PDF

Open Access

TL;DR

This paper introduces DCF-Net, a novel time-frequency domain network that leverages mixture and enrollment interactions for more accurate target speaker extraction, outperforming state-of-the-art methods.

Contribution

The paper proposes a DualStream Fusion Block to better utilize contextual and interaction information between mixture and enrollment signals in speaker extraction.

Findings

01

Achieves 21.6 dB SI-SDRi on benchmark dataset.

02

Reduces target confusion to 0.4%.

03

Demonstrates robustness in noisy and reverberant environments.

Abstract

Target speaker extraction focuses on extracting a target speech signal from an environment with multiple speakers by leveraging an enrollment. Existing methods predominantly rely on speaker embeddings obtained from the enrollment, potentially disregarding the contextual information and the internal interactions between the mixture and enrollment. In this paper, we propose a novel DualStream Contextual Fusion Network (DCF-Net) in the time-frequency (T-F) domain. Specifically, DualStream Fusion Block (DSFB) is introduced to obtain contextual information and capture the interactions between contextualized enrollment and mixture representation across both spatial and channel dimensions, and then rich and consistent representations are utilized to guide the extraction network for better extraction. Experimental results demonstrate that DCF-Net outperforms state-of-the-art (SOTA) methods,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing