SCDiar: a streaming diarization system based on speaker change detection and speech recognition
Naijun Zheng, Xucheng Wan, Kai Liu, Zhou Huan

TL;DR
SCDiar is a real-time speaker diarization system that improves accuracy in long meetings by detecting speaker changes at the token level and selecting optimal speech segments, outperforming previous methods significantly.
Contribution
The paper introduces SCDiar, a novel streaming diarization system that leverages speaker change detection at the token level and segment selection enhancements for improved accuracy.
Findings
Achieves up to 53.6% accuracy improvement on real-world data.
Reduces the performance gap between online and offline diarization systems.
Demonstrates significant gains across various benchmark datasets.
Abstract
In hours-long meeting scenarios, real-time speech stream often struggles with achieving accurate speaker diarization, commonly leading to speaker identification and speaker count errors. To address this challenge, we propose SCDiar, a system that operates on speech segments, split at the token level by a speaker change detection (SCD) module. Building on these segments, we introduce several enhancements to efficiently select the best available segment for each speaker. These improvements lead to significant gains across various benchmarks. Notably, on real-world meeting data involving more than ten participants, SCDiar outperforms previous systems by up to 53.6\% in accuracy, substantially narrowing the performance gap between online and offline systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques
