Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization
Yifan Chen, Yifan Guo, Qingxuan Li, Gaofeng Cheng, Pengyuan Zhang,, Yonghong Yan

TL;DR
This paper introduces a unified online clustering framework for speaker diarization that integrates training and searching processes, significantly improving real-time clustering accuracy.
Contribution
It proposes a novel interactive framework combining clustering-guided recurrent training and truncated beam searching clustering for online speaker diarization.
Findings
Achieved 14.48% DER on AISHELL-4 with low latency
Outperformed offline agglomerative hierarchical clustering
Introduced cluster-aware training for embedding extractors
Abstract
For online speaker diarization, samples arrive incrementally, and the overall distribution of the samples is invisible. Moreover, in most existing clustering-based methods, the training objective of the embedding extractor is not designed specially for clustering. To improve online speaker diarization performance, we propose a unified online clustering framework, which provides an interactive manner between embedding extractors and clustering algorithms. Specifically, the framework consists of two highly coupled parts: clustering-guided recurrent training (CGRT) and truncated beam searching clustering (TBSC). The CGRT introduces the clustering algorithm into the training process of embedding extractors, which could provide not only cluster-aware information for the embedding extractor, but also crucial parameters for the clustering process afterward. And with these parameters, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
