Incorporating Spatial Cues in Modular Speaker Diarization for Multi-channel Multi-party Meetings
Ruoyu Wang, Shutong Niu, Gaobin Yang, Jun Du, Shuangqing Qian, Tian, Gao, Jia Pan

TL;DR
This paper introduces a three-stage modular speaker diarization system that leverages spatial cues from multi-channel recordings to improve accuracy and robustness in multi-party meetings, achieving top results in a challenging benchmark.
Contribution
The paper presents a novel multi-stage modular approach incorporating spatial cues for enhanced neural speaker diarization in multi-channel recordings.
Findings
Achieved first place in the CHiME-8 NOTSOFAR-1 challenge.
Demonstrated improved speaker error rates with spatial cue integration.
Validated effectiveness of the multi-stage system through extensive evaluation.
Abstract
Although fully end-to-end speaker diarization systems have made significant progress in recent years, modular systems often achieve superior results in real-world scenarios due to their greater adaptability and robustness. Historically, modular speaker diarization methods have seldom discussed how to leverage spatial cues from multi-channel speech. This paper proposes a three-stage modular system to enhance single-channel neural speaker diarization systems and recognition performance by utilizing spatial cues from multi-channel speech to provide more accurate initialization for each stage of neural speaker diarization (NSD) decoding: (1) Overlap detection and continuous speech separation (CSS) on multi-channel speech are used to obtain cleaner single speaker speech segments for clustering, followed by the first NSD decoding pass. (2) The results from the first pass initialize a complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
