O-EENC-SD: Efficient Online End-to-End Neural Clustering for Speaker Diarization
Elio Gruttadauria (IP Paris, LTCI, IDS, S2A), Mathieu Fontaine (LTCI, IP Paris), Jonathan Le Roux, Slim Essid (IDS, S2A, LTCI)

TL;DR
O-EENC-SD is a novel online speaker diarization system that offers a hyperparameter-free, efficient, and competitive solution for two-speaker conversations, balancing accuracy and computational complexity.
Contribution
The paper introduces a new online end-to-end neural clustering method with a centroid refinement decoder and a RNN-based stitching mechanism, improving efficiency and simplicity over existing approaches.
Findings
Achieves competitive DER on CallHome dataset
Offers a hyperparameter-free alternative to clustering methods
Provides an efficient solution with low computational cost
Abstract
We introduce O-EENC-SD: an end-to-end online speaker diarization system based on EEND-EDA, featuring a novel RNN-based stitching mechanism for online prediction. In particular, we develop a novel centroid refinement decoder whose usefulness is assessed through a rigorous ablation study. Our system provides key advantages over existing methods: a hyperparameter-free solution compared to unsupervised clustering approaches, and a more efficient alternative to current online end-to-end methods, which are computationally costly. We demonstrate that O-EENC-SD is competitive with the state of the art in the two-speaker conversational telephone speech domain, as tested on the CallHome dataset. Our results show that O-EENC-SD provides a great trade-off between DER and complexity, even when working on independent chunks with no overlap, making the system extremely efficient.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Authorship Attribution and Profiling
