End-to-End Neural Diarization: Reformulating Speaker Diarization as   Simple Multi-label Classification

Yusuke Fujita; Shinji Watanabe; Shota Horiguchi; Yawen Xue; Kenji; Nagamatsu

arXiv:2003.02966·eess.AS·March 9, 2020·43 cites

End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification

Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Kenji, Nagamatsu

PDF

Open Access 1 Repo

TL;DR

This paper introduces End-to-End Neural Diarization (EEND), a neural network approach that directly outputs speaker diarization results from multi-speaker recordings, effectively handling overlaps and outperforming traditional clustering methods.

Contribution

The paper proposes a novel end-to-end neural network model formulated as a multi-label classification problem with a permutation-free objective, improving speaker diarization accuracy and overlap handling.

Findings

01

EEND outperforms state-of-the-art clustering-based methods.

02

Self-attention neural networks effectively capture global speaker characteristics.

03

The model adapts easily to real conversations with speaker overlaps.

Abstract

The most common approach to speaker diarization is clustering of speaker embeddings. However, the clustering-based approach has a number of problems; i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps. To solve these problems, we propose the End-to-End Neural Diarization (EEND), in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, we formulate the speaker diarization problem as a multi-label classification problem and introduce a permutation-free objective function to directly minimize diarization errors. Besides its end-to-end simplicity, the EEND method can explicitly handle speaker overlaps during training and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Xflick/EEND_PyTorch
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques

MethodsEnd-to-End Neural Diarization