Enhancing Target Speaker Extraction with Explicit Speaker Consistency Modeling
Shu Wu, Anbin Qi, Yanzhang Xie, Xiang Xie

TL;DR
This paper introduces a novel target speaker extraction method that emphasizes speaker consistency through a centroid-based loss and conditional loss suppression, significantly improving extraction accuracy.
Contribution
It presents a speaker consistency-aware TSE approach with a centroid-based loss and loss suppression, addressing speaker identity confusion and enhancing performance.
Findings
Improved TSE accuracy demonstrated in experiments
Speaker consistency loss enhances speaker embedding reliability
Conditional loss suppression stabilizes training process
Abstract
Target Speaker Extraction (TSE) uses a reference cue to extract the target speech from a mixture. In TSE systems relying on audio cues, the speaker embedding from the enrolled speech is crucial to performance. However, these embeddings may suffer from speaker identity confusion. Unlike previous studies that focus on improving speaker embedding extraction, we improve TSE performance from the perspective of speaker consistency. In this paper, we propose a speaker consistency-aware target speaker extraction method that incorporates a centroid-based speaker consistency loss. This approach enhances TSE performance by ensuring speaker consistency between the enrolled and extracted speech. In addition, we integrate conditional loss suppression into the training process. The experimental results validate the effectiveness of our proposed methods in advancing the TSE performance. A speech demo…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
