Online Target Speaker Voice Activity Detection for Speaker Diarization
Weiqing Wang, Qingjian Lin, Ming Li

TL;DR
This paper introduces an online speaker activity detection system that dynamically updates target speaker embeddings for diarization, eliminating the need for prior clustering-based embeddings and improving real-time performance.
Contribution
The proposed system enables online speaker diarization without relying on pre-existing speaker embeddings, using a ResNet-based model and iterative updates for real-time accuracy.
Findings
Outperforms offline clustering-based diarization on AliMeeting dataset
Operates in real-time without prior speaker embedding knowledge
Uses iterative embedding updates for improved accuracy
Abstract
This paper proposes an online target speaker voice activity detection system for speaker diarization tasks, which does not require a priori knowledge from the clustering-based diarization system to obtain the target speaker embeddings. First, we employ a ResNet-based front-end model to extract the frame-level speaker embeddings for each coming block of a signal. Next, we predict the detection state of each speaker based on these frame-level speaker embeddings and the previously estimated target speaker embedding. Then, the target speaker embeddings are updated by aggregating these frame-level speaker embeddings according to the predictions in the current block. We iteratively extract the results for each block and update the target speaker embedding until reaching the end of the signal. Experimental results show that the proposed method is better than the offline clustering-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
