Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation
Juan M. Coria, Herv\'e Bredin, Sahar Ghannay, Sophie Rosset

TL;DR
This paper introduces an online speaker diarization method that combines incremental clustering with local segmentation, leveraging overlap-aware segmentation and adjustable latency to improve real-time speaker separation.
Contribution
It presents a novel online diarization pipeline that integrates overlap-aware segmentation with modified statistics pooling and cannot-link constraints, enabling low-latency and improved accuracy.
Findings
Effective overlap-aware segmentation improves diarization accuracy.
Latency can be tuned between 500ms and 5s with systematic performance analysis.
Method outperforms baseline approaches on AMI, DIHARD, and VoxConverse datasets.
Abstract
We propose to address online speaker diarization as a combination of incremental clustering and local diarization applied to a rolling buffer updated every 500ms. Every single step of the proposed pipeline is designed to take full advantage of the strong ability of a recently proposed end-to-end overlap-aware segmentation to detect and separate overlapping speakers. In particular, we propose a modified version of the statistics pooling layer (initially introduced in the x-vector architecture) to give less weight to frames where the segmentation model predicts simultaneous speakers. Furthermore, we derive cannot-link constraints from the initial segmentation step to prevent two local speakers from being wrongfully merged during the incremental clustering step. Finally, we show how the latency of the proposed approach can be adjusted between 500ms and 5s to match the requirements of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Video Analysis and Summarization
