Utterance-by-utterance overlap-aware neural diarization with Graph-PIT

Keisuke Kinoshita; Thilo von Neumann; Marc Delcroix; Christoph; Boeddeker; Reinhold Haeb-Umbach

arXiv:2207.13888·eess.AS·July 29, 2022

Utterance-by-utterance overlap-aware neural diarization with Graph-PIT

Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph, Boeddeker, Reinhold Haeb-Umbach

PDF

Open Access 1 Repo

TL;DR

This paper introduces a segmentation-free neural diarization framework that performs utterance-by-utterance clustering using Graph-PIT, effectively handling overlapping speech and many speakers in full meetings.

Contribution

It proposes a novel segmentation-free diarization approach utilizing Graph-PIT, enabling end-to-end processing of entire meetings with overlapping speech.

Findings

01

Outperforms conventional segmentation-based methods on simulated and real datasets.

02

Effectively handles overlapping speech and large number of speakers.

03

Demonstrates superior performance in meeting-like scenarios.

Abstract

Recent speaker diarization studies showed that integration of end-to-end neural diarization (EEND) and clustering-based diarization is a promising approach for achieving state-of-the-art performance on various tasks. Such an approach first divides an observed signal into fixed-length segments, then performs {\it segment-level} local diarization based on an EEND module, and merges the segment-level results via clustering to form a final global diarization result. The segmentation is done to limit the number of speakers in each segment since the current EEND cannot handle a large number of speakers. In this paper, we argue that such an approach involving the segmentation has several issues; for example, it inevitably faces a dilemma that larger segment sizes increase both the context available for enhancing the performance and the number of speakers for the local EEND module to handle. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fgnt/graph_pit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsEnd-to-End Neural Diarization