Guided Speaker Embedding

Shota Horiguchi; Takafumi Moriya; Atsushi Ando; Takanori Ashihara,; Hiroshi Sato; Naohiro Tawara; Marc Delcroix

arXiv:2410.12182·eess.AS·January 3, 2025

Guided Speaker Embedding

Shota Horiguchi, Takafumi Moriya, Atsushi Ando, Takanori Ashihara,, Hiroshi Sato, Naohiro Tawara, Marc Delcroix

PDF

Open Access

TL;DR

This paper introduces a guided speaker embedding extraction method that leverages speech activity cues to extract target speaker embeddings from overlapping speech, improving performance in speaker verification and diarization tasks.

Contribution

It proposes a novel approach that uses speaker activity information to directly extract embeddings from overlapped speech, addressing limitations of traditional single-interval methods.

Findings

01

Improved speaker verification accuracy.

02

Enhanced speaker diarization performance.

03

Effective extraction from overlapped speech segments.

Abstract

This paper proposes a guided speaker embedding extraction system, which extracts speaker embeddings of the target speaker using speech activities of target and interference speakers as clues. Several methods for long-form overlapped multi-speaker audio processing are typically two-staged: i) segment-level processing and ii) inter-segment speaker matching. Speaker embeddings are often used for the latter purpose. Typical speaker embedding extraction approaches only use single-speaker intervals to avoid corrupting the embeddings with speech from interference speakers. However, this often makes speaker embeddings impossible to extract because sufficiently long non-overlapping intervals are not always available. In this paper, we propose using speaker activities as clues to extract the embedding of the speaker-of-interest directly from overlapping speech. Specifically, we concatenate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need