From Modular to End-to-End Speaker Diarization
Federico Landini

TL;DR
This paper reviews the evolution from modular to end-to-end speaker diarization, introducing new models and data generation techniques that improve handling overlapped speech and multiple speakers.
Contribution
It presents a new EEND-based model called DiaPer, compares it with VBx, and introduces a synthetic data generation method for training neural diarization models.
Findings
DiaPer outperforms EEND-EDA with many speakers and overlaps.
Synthetic data improves neural diarization training.
VBx remains effective with clustering approaches.
Abstract
Speaker diarization is usually referred to as the task that determines ``who spoke when'' in a recording. Until a few years ago, all competitive approaches were modular. Systems based on this framework reached state-of-the-art performance in most scenarios but had major difficulties dealing with overlapped speech. More recently, the advent of end-to-end models, capable of dealing with all aspects of speaker diarization with a single model and better performing regarding overlapped speech, has brought high levels of attention. This thesis is framed during a period of co-existence of these two trends. We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx, which has shown remarkable performance on different datasets and challenges. We comment on its advantages and limitations and evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsEnd-to-End Neural Diarization
