Towards Neural Diarization for Unlimited Numbers of Speakers Using   Global and Local Attractors

Shota Horiguchi; Shinji Watanabe; Paola Garcia; Yawen Xue; Yuki; Takashima; Yohei Kawaguchi

arXiv:2107.01545·eess.AS·September 24, 2021·1 cites

Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors

Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yawen Xue, Yuki, Takashima, Yohei Kawaguchi

PDF

Open Access

TL;DR

This paper introduces an unsupervised clustering approach within attractor-based end-to-end diarization, enabling accurate speaker diarization for an unlimited number of speakers, surpassing previous methods on multiple datasets.

Contribution

It presents a novel unsupervised clustering method integrated with attractor-based diarization, allowing for scalable speaker diarization beyond the training set limitations.

Findings

01

Achieved accurate diarization for an unseen number of speakers.

02

Outperformed conventional end-to-end diarization methods on multiple datasets.

03

Demonstrated robustness in challenging diarization scenarios.

Abstract

Attractor-based end-to-end diarization is achieving comparable accuracy to the carefully tuned conventional clustering-based methods on challenging datasets. However, the main drawback is that it cannot deal with the case where the number of speakers is larger than the one observed during training. This is because its speaker counting relies on supervised learning. In this work, we introduce an unsupervised clustering process embedded in the attractor-based end-to-end diarization. We first split a sequence of frame-wise embeddings into short subsequences and then perform attractor-based diarization for each subsequence. Given subsequence-wise diarization results, inter-subsequence speaker correspondence is obtained by unsupervised clustering of the vectors computed from the attractors from all the subsequences. This makes it possible to produce diarization results of a large number of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing