Online Target Speaker Voice Activity Detection for Speaker Diarization

Weiqing Wang; Qingjian Lin; Ming Li

arXiv:2207.05920·eess.AS·July 14, 2022

Online Target Speaker Voice Activity Detection for Speaker Diarization

Weiqing Wang, Qingjian Lin, Ming Li

PDF

Open Access

TL;DR

This paper introduces an online speaker activity detection system that dynamically updates target speaker embeddings for diarization, eliminating the need for prior clustering-based embeddings and improving real-time performance.

Contribution

The proposed system enables online speaker diarization without relying on pre-existing speaker embeddings, using a ResNet-based model and iterative updates for real-time accuracy.

Findings

01

Outperforms offline clustering-based diarization on AliMeeting dataset

02

Operates in real-time without prior speaker embedding knowledge

03

Uses iterative embedding updates for improved accuracy

Abstract

This paper proposes an online target speaker voice activity detection system for speaker diarization tasks, which does not require a priori knowledge from the clustering-based diarization system to obtain the target speaker embeddings. First, we employ a ResNet-based front-end model to extract the frame-level speaker embeddings for each coming block of a signal. Next, we predict the detection state of each speaker based on these frame-level speaker embeddings and the previously estimated target speaker embedding. Then, the target speaker embeddings are updated by aggregating these frame-level speaker embeddings according to the predictions in the current block. We iteratively extract the results for each block and update the target speaker embedding until reaching the end of the signal. Experimental results show that the proposed method is better than the offline clustering-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing