Systematic Evaluation of Online Speaker Diarization Systems Regarding their Latency
Roman Aperdannier, Sigurd Schacht, Alexander Piazza

TL;DR
This paper systematically evaluates the latency of various online speaker diarization systems on identical hardware and data, highlighting the performance of DIART and FS-EEND systems.
Contribution
It provides the first comparative analysis of online diarization systems focusing on latency, using standardized hardware and datasets.
Findings
DIART-pipeline with specific models achieves lowest latency
FS-EEND system demonstrates comparable low latency
No prior research compares online diarization systems based on latency
Abstract
In this paper, different online speaker diarization systems are evaluated on the same hardware with the same test data with regard to their latency. The latency is the time span from audio input to the output of the corresponding speaker label. As part of the evaluation, various model combinations within the DIART framework, a diarization system based on the online clustering algorithm UIS-RNN-SML, and the end-to-end online diarization system FS-EEND are compared. The lowest latency is achieved for the DIART-pipeline with the embedding model pyannote/embedding and the segmentation model pyannote/segmentation. The FS-EEND system shows a similarly good latency. In general there is currently no published research that compares several online diarization systems in terms of their latency. This makes this work even more relevant.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
