O-EENC-SD: Efficient Online End-to-End Neural Clustering for Speaker Diarization

Elio Gruttadauria (IP Paris; LTCI; IDS; S2A); Mathieu Fontaine (LTCI; IP Paris); Jonathan Le Roux; Slim Essid (IDS; S2A; LTCI)

arXiv:2512.15229·cs.LG·December 18, 2025

O-EENC-SD: Efficient Online End-to-End Neural Clustering for Speaker Diarization

Elio Gruttadauria (IP Paris, LTCI, IDS, S2A), Mathieu Fontaine (LTCI, IP Paris), Jonathan Le Roux, Slim Essid (IDS, S2A, LTCI)

PDF

Open Access

TL;DR

O-EENC-SD is a novel online speaker diarization system that offers a hyperparameter-free, efficient, and competitive solution for two-speaker conversations, balancing accuracy and computational complexity.

Contribution

The paper introduces a new online end-to-end neural clustering method with a centroid refinement decoder and a RNN-based stitching mechanism, improving efficiency and simplicity over existing approaches.

Findings

01

Achieves competitive DER on CallHome dataset

02

Offers a hyperparameter-free alternative to clustering methods

03

Provides an efficient solution with low computational cost

Abstract

We introduce O-EENC-SD: an end-to-end online speaker diarization system based on EEND-EDA, featuring a novel RNN-based stitching mechanism for online prediction. In particular, we develop a novel centroid refinement decoder whose usefulness is assessed through a rigorous ablation study. Our system provides key advantages over existing methods: a hyperparameter-free solution compared to unsupervised clustering approaches, and a more efficient alternative to current online end-to-end methods, which are computationally costly. We demonstrate that O-EENC-SD is competitive with the state of the art in the two-speaker conversational telephone speech domain, as tested on the CallHome dataset. Our results show that O-EENC-SD provides a great trade-off between DER and complexity, even when working on independent chunks with no overlap, making the system extremely efficient.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Authorship Attribution and Profiling