Speaker Embeddings With Weakly Supervised Voice Activity Detection For   Efficient Speaker Diarization

Jenthe Thienpondt; Kris Demuynck

arXiv:2405.09142·eess.AS·May 16, 2024

Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization

Jenthe Thienpondt, Kris Demuynck

PDF

Open Access

TL;DR

This paper demonstrates that the attention mechanism within speaker embedding models can serve as an effective weakly supervised voice activity detector, enabling more efficient speaker diarization without external VAD models.

Contribution

It introduces a novel speaker diarization pipeline using ECAPA2 embeddings that combines VAD and speaker embedding extraction, achieving state-of-the-art results.

Findings

01

Attention system acts as a weakly supervised VAD.

02

Proposed method outperforms existing diarization systems.

03

Achieves state-of-the-art results on multiple benchmarks.

Abstract

Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing