SCDiar: a streaming diarization system based on speaker change detection   and speech recognition

Naijun Zheng; Xucheng Wan; Kai Liu; Zhou Huan

arXiv:2501.16641·eess.AS·January 29, 2025

SCDiar: a streaming diarization system based on speaker change detection and speech recognition

Naijun Zheng, Xucheng Wan, Kai Liu, Zhou Huan

PDF

Open Access

TL;DR

SCDiar is a real-time speaker diarization system that improves accuracy in long meetings by detecting speaker changes at the token level and selecting optimal speech segments, outperforming previous methods significantly.

Contribution

The paper introduces SCDiar, a novel streaming diarization system that leverages speaker change detection at the token level and segment selection enhancements for improved accuracy.

Findings

01

Achieves up to 53.6% accuracy improvement on real-world data.

02

Reduces the performance gap between online and offline diarization systems.

03

Demonstrates significant gains across various benchmark datasets.

Abstract

In hours-long meeting scenarios, real-time speech stream often struggles with achieving accurate speaker diarization, commonly leading to speaker identification and speaker count errors. To address this challenge, we propose SCDiar, a system that operates on speech segments, split at the token level by a speaker change detection (SCD) module. Building on these segments, we introduce several enhancements to efficiently select the best available segment for each speaker. These improvements lead to significant gains across various benchmarks. Notably, on real-world meeting data involving more than ten participants, SCDiar outperforms previous systems by up to 53.6\% in accuracy, substantially narrowing the performance gap between online and offline systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques