Interrelate Training and Searching: A Unified Online Clustering   Framework for Speaker Diarization

Yifan Chen; Yifan Guo; Qingxuan Li; Gaofeng Cheng; Pengyuan Zhang,; Yonghong Yan

arXiv:2206.13760·eess.AS·June 29, 2022

Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization

Yifan Chen, Yifan Guo, Qingxuan Li, Gaofeng Cheng, Pengyuan Zhang,, Yonghong Yan

PDF

Open Access

TL;DR

This paper introduces a unified online clustering framework for speaker diarization that integrates training and searching processes, significantly improving real-time clustering accuracy.

Contribution

It proposes a novel interactive framework combining clustering-guided recurrent training and truncated beam searching clustering for online speaker diarization.

Findings

01

Achieved 14.48% DER on AISHELL-4 with low latency

02

Outperformed offline agglomerative hierarchical clustering

03

Introduced cluster-aware training for embedding extractors

Abstract

For online speaker diarization, samples arrive incrementally, and the overall distribution of the samples is invisible. Moreover, in most existing clustering-based methods, the training objective of the embedding extractor is not designed specially for clustering. To improve online speaker diarization performance, we propose a unified online clustering framework, which provides an interactive manner between embedding extractors and clustering algorithms. Specifically, the framework consists of two highly coupled parts: clustering-guided recurrent training (CGRT) and truncated beam searching clustering (TBSC). The CGRT introduces the clustering algorithm into the training process of embedding extractors, which could provide not only cluster-aware information for the embedding extractor, but also crucial parameters for the clustering process afterward. And with these parameters, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing