Tight integration of neural- and clustering-based diarization through deep unfolding of infinite Gaussian mixture model
Keisuke Kinoshita, Marc Delcroix, Tomoharu Iwata

TL;DR
This paper presents a novel deep-unfolded infinite Gaussian mixture model that tightly integrates neural and clustering-based speaker diarization, significantly improving accuracy in overlapping speech scenarios.
Contribution
It introduces a trainable clustering method via deep unfolding of iGMM, optimizing speaker embeddings for better clustering in diarization tasks.
Findings
Outperforms conventional methods in DER on CALLHOME data
Reduces speaker confusion errors significantly
Demonstrates effective integration of neural and clustering models
Abstract
Speaker diarization has been investigated extensively as an important central task for meeting analysis. Recent trend shows that integration of end-to-end neural (EEND)-and clustering-based diarization is a promising approach to handle realistic conversational data containing overlapped speech with an arbitrarily large number of speakers, and achieved state-of-the-art results on various tasks. However, the approaches proposed so far have not realized {\it tight} integration yet, because the clustering employed therein was not optimal in any sense for clustering the speaker embeddings estimated by the EEND module. To address this problem, this paper introduces a {\it trainable} clustering algorithm into the integration framework, by deep-unfolding a non-parametric Bayesian model called the infinite Gaussian mixture model (iGMM). Specifically, the speaker embeddings are optimized during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsEnd-to-End Neural Diarization
