Mitigating Non-Target Speaker Bias in Guided Speaker Embedding

Shota Horiguchi; Takanori Ashihara; Marc Delcroix; Atsushi Ando; Naohiro Tawara

arXiv:2506.12500·eess.AS·June 17, 2025

Mitigating Non-Target Speaker Bias in Guided Speaker Embedding

Shota Horiguchi, Takanori Ashihara, Marc Delcroix, Atsushi Ando, Naohiro Tawara

PDF

Open Access

TL;DR

This paper addresses the issue of non-target speaker bias in guided speaker embeddings by proposing a method that leverages target speaker activity clues to improve performance in overlapping speech scenarios.

Contribution

It introduces an extension to global-statistics modules that incorporates target speaker activity, reducing bias and enhancing embedding quality.

Findings

01

Improved speaker verification accuracy across various overlap ratios.

02

Enhanced diarization performance on multiple datasets.

03

Reduction in bias caused by non-target speaker intervals.

Abstract

Obtaining high-quality speaker embeddings in multi-speaker conditions is crucial for many applications. A recently proposed guided speaker embedding framework, which utilizes speech activities of target and non-target speakers as clues, drastically improved embeddings under severe overlap with small degradation in low-overlap cases. However, since extreme overlaps are rare in natural conversations, this degradation cannot be overlooked. This paper first reveals that the degradation is caused by the global-statistics-based modules, widely used in speaker embedding extractors, being overly sensitive to intervals containing only non-target speakers. As a countermeasure, we propose an extension of such modules that exploit the target speaker activity clues, to compute statistics from intervals where the target is active. The proposed method improves speaker verification performance in both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis