Enhancing Target Speaker Extraction with Explicit Speaker Consistency Modeling

Shu Wu; Anbin Qi; Yanzhang Xie; Xiang Xie

arXiv:2507.09510·cs.SD·August 12, 2025

Enhancing Target Speaker Extraction with Explicit Speaker Consistency Modeling

Shu Wu, Anbin Qi, Yanzhang Xie, Xiang Xie

PDF

Open Access

TL;DR

This paper introduces a novel target speaker extraction method that emphasizes speaker consistency through a centroid-based loss and conditional loss suppression, significantly improving extraction accuracy.

Contribution

It presents a speaker consistency-aware TSE approach with a centroid-based loss and loss suppression, addressing speaker identity confusion and enhancing performance.

Findings

01

Improved TSE accuracy demonstrated in experiments

02

Speaker consistency loss enhances speaker embedding reliability

03

Conditional loss suppression stabilizes training process

Abstract

Target Speaker Extraction (TSE) uses a reference cue to extract the target speech from a mixture. In TSE systems relying on audio cues, the speaker embedding from the enrolled speech is crucial to performance. However, these embeddings may suffer from speaker identity confusion. Unlike previous studies that focus on improving speaker embedding extraction, we improve TSE performance from the perspective of speaker consistency. In this paper, we propose a speaker consistency-aware target speaker extraction method that incorporates a centroid-based speaker consistency loss. This approach enhances TSE performance by ensuring speaker consistency between the enrolled and extracted speech. In addition, we integrate conditional loss suppression into the training process. The experimental results validate the effectiveness of our proposed methods in advancing the TSE performance. A speech demo…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing