Guided Speaker Embedding
Shota Horiguchi, Takafumi Moriya, Atsushi Ando, Takanori Ashihara,, Hiroshi Sato, Naohiro Tawara, Marc Delcroix

TL;DR
This paper introduces a guided speaker embedding extraction method that leverages speech activity cues to extract target speaker embeddings from overlapping speech, improving performance in speaker verification and diarization tasks.
Contribution
It proposes a novel approach that uses speaker activity information to directly extract embeddings from overlapped speech, addressing limitations of traditional single-interval methods.
Findings
Improved speaker verification accuracy.
Enhanced speaker diarization performance.
Effective extraction from overlapped speech segments.
Abstract
This paper proposes a guided speaker embedding extraction system, which extracts speaker embeddings of the target speaker using speech activities of target and interference speakers as clues. Several methods for long-form overlapped multi-speaker audio processing are typically two-staged: i) segment-level processing and ii) inter-segment speaker matching. Speaker embeddings are often used for the latter purpose. Typical speaker embedding extraction approaches only use single-speaker intervals to avoid corrupting the embeddings with speech from interference speakers. However, this often makes speaker embeddings impossible to extract because sufficiently long non-overlapping intervals are not always available. In this paper, we propose using speaker activities as clues to extract the embedding of the speaker-of-interest directly from overlapping speech. Specifically, we concatenate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need
