Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors
Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yawen Xue, Yuki, Takashima, Yohei Kawaguchi

TL;DR
This paper introduces an unsupervised clustering approach within attractor-based end-to-end diarization, enabling accurate speaker diarization for an unlimited number of speakers, surpassing previous methods on multiple datasets.
Contribution
It presents a novel unsupervised clustering method integrated with attractor-based diarization, allowing for scalable speaker diarization beyond the training set limitations.
Findings
Achieved accurate diarization for an unseen number of speakers.
Outperformed conventional end-to-end diarization methods on multiple datasets.
Demonstrated robustness in challenging diarization scenarios.
Abstract
Attractor-based end-to-end diarization is achieving comparable accuracy to the carefully tuned conventional clustering-based methods on challenging datasets. However, the main drawback is that it cannot deal with the case where the number of speakers is larger than the one observed during training. This is because its speaker counting relies on supervised learning. In this work, we introduce an unsupervised clustering process embedded in the attractor-based end-to-end diarization. We first split a sequence of frame-wise embeddings into short subsequences and then perform attractor-based diarization for each subsequence. Given subsequence-wise diarization results, inter-subsequence speaker correspondence is obtained by unsupervised clustering of the vectors computed from the attractors from all the subsequences. This makes it possible to produce diarization results of a large number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
