Incorporating Spatial Cues in Modular Speaker Diarization for   Multi-channel Multi-party Meetings

Ruoyu Wang; Shutong Niu; Gaobin Yang; Jun Du; Shuangqing Qian; Tian; Gao; Jia Pan

arXiv:2409.16803·eess.AS·September 26, 2024

Incorporating Spatial Cues in Modular Speaker Diarization for Multi-channel Multi-party Meetings

Ruoyu Wang, Shutong Niu, Gaobin Yang, Jun Du, Shuangqing Qian, Tian, Gao, Jia Pan

PDF

Open Access

TL;DR

This paper introduces a three-stage modular speaker diarization system that leverages spatial cues from multi-channel recordings to improve accuracy and robustness in multi-party meetings, achieving top results in a challenging benchmark.

Contribution

The paper presents a novel multi-stage modular approach incorporating spatial cues for enhanced neural speaker diarization in multi-channel recordings.

Findings

01

Achieved first place in the CHiME-8 NOTSOFAR-1 challenge.

02

Demonstrated improved speaker error rates with spatial cue integration.

03

Validated effectiveness of the multi-stage system through extensive evaluation.

Abstract

Although fully end-to-end speaker diarization systems have made significant progress in recent years, modular systems often achieve superior results in real-world scenarios due to their greater adaptability and robustness. Historically, modular speaker diarization methods have seldom discussed how to leverage spatial cues from multi-channel speech. This paper proposes a three-stage modular system to enhance single-channel neural speaker diarization systems and recognition performance by utilizing spatial cues from multi-channel speech to provide more accurate initialization for each stage of neural speaker diarization (NSD) decoding: (1) Overlap detection and continuous speech separation (CSS) on multi-channel speech are used to obtain cleaner single speaker speech segments for clustering, followed by the first NSD decoding pass. (2) The results from the first pass initialize a complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems