Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization
Jenthe Thienpondt, Kris Demuynck

TL;DR
This paper demonstrates that the attention mechanism within speaker embedding models can serve as an effective weakly supervised voice activity detector, enabling more efficient speaker diarization without external VAD models.
Contribution
It introduces a novel speaker diarization pipeline using ECAPA2 embeddings that combines VAD and speaker embedding extraction, achieving state-of-the-art results.
Findings
Attention system acts as a weakly supervised VAD.
Proposed method outperforms existing diarization systems.
Achieves state-of-the-art results on multiple benchmarks.
Abstract
Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
