Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments

Shitong Xu; Yiyuan Yang; Niki Trigoni; Andrew Markham

arXiv:2502.16611·cs.SD·December 9, 2025

Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments

Shitong Xu, Yiyuan Yang, Niki Trigoni, Andrew Markham

PDF

Open Access

TL;DR

This paper introduces a novel method for target speaker extraction that leverages noisy enrollment segments by comparing positive and negative speech segments, achieving state-of-the-art results without requiring clean enrollment data.

Contribution

The work proposes a new enrollment strategy using noisy positive and negative segments and a two-stage training process, advancing target speaker extraction in realistic noisy conditions.

Findings

01

Achieves over 2.1 dB higher SI-SNRi than previous methods.

02

Reduces training convergence time by 60%.

03

State-of-the-art performance in noisy enrollment scenarios.

Abstract

Target speaker extraction focuses on isolating a specific speaker's voice from an audio mixture containing multiple speakers. To provide information about the target speaker's identity, prior works have utilized clean audio samples as conditioning inputs. However, such clean audio examples are not always readily available. For instance, obtaining a clean recording of a stranger's voice at a cocktail party without leaving the noisy environment is generally infeasible. Limited prior research has explored extracting the target speaker's characteristics from noisy enrollments, which may contain overlapping speech from interfering speakers. In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment by comparing segments where the target speaker is talking (Positive Enrollments) with segments where the target speaker is silent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis

MethodsFocus