Utterance-by-utterance overlap-aware neural diarization with Graph-PIT
Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph, Boeddeker, Reinhold Haeb-Umbach

TL;DR
This paper introduces a segmentation-free neural diarization framework that performs utterance-by-utterance clustering using Graph-PIT, effectively handling overlapping speech and many speakers in full meetings.
Contribution
It proposes a novel segmentation-free diarization approach utilizing Graph-PIT, enabling end-to-end processing of entire meetings with overlapping speech.
Findings
Outperforms conventional segmentation-based methods on simulated and real datasets.
Effectively handles overlapping speech and large number of speakers.
Demonstrates superior performance in meeting-like scenarios.
Abstract
Recent speaker diarization studies showed that integration of end-to-end neural diarization (EEND) and clustering-based diarization is a promising approach for achieving state-of-the-art performance on various tasks. Such an approach first divides an observed signal into fixed-length segments, then performs {\it segment-level} local diarization based on an EEND module, and merges the segment-level results via clustering to form a final global diarization result. The segmentation is done to limit the number of speakers in each segment since the current EEND cannot handle a large number of speakers. In this paper, we argue that such an approach involving the segmentation has several issues; for example, it inevitably faces a dilemma that larger segment sizes increase both the context available for enhancing the performance and the number of speakers for the local EEND module to handle. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsEnd-to-End Neural Diarization
