DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding

Neil Zeghidour; Olivier Teboul; David Grangier

arXiv:2105.13802·cs.SD·May 31, 2021

DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding

Neil Zeghidour, Olivier Teboul, David Grangier

PDF

TL;DR

DIVE is an innovative end-to-end neural speaker diarization system that iteratively refines speaker representations, eliminating the need for pretrained embeddings and achieving state-of-the-art results on the CALLHOME benchmark.

Contribution

It introduces an iterative speaker embedding approach that resolves speaker ordering without permutation invariant training and does not rely on pretrained speaker models.

Findings

01

Achieves 6.7% DER on CALLHOME, outperforming previous methods.

02

Does not require pretrained speaker representations.

03

Optimizes all parameters with a multi-speaker voice activity loss.

Abstract

We introduce DIVE, an end-to-end speaker diarization algorithm. Our neural algorithm presents the diarization task as an iterative process: it repeatedly builds a representation for each speaker before predicting the voice activity of each speaker conditioned on the extracted representations. This strategy intrinsically resolves the speaker ordering ambiguity without requiring the classical permutation invariant training loss. In contrast with prior work, our model does not rely on pretrained speaker representations and optimizes all parameters of the system with a multi-speaker voice activity loss. Importantly, our loss explicitly excludes unreliable speaker turn boundaries from training, which is adapted to the standard collar-based Diarization Error Rate (DER) evaluation. Overall, these contributions yield a system redefining the state-of-the-art on the standard CALLHOME benchmark,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.