The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023
Ming Cheng, Weiqing Wang, Xiaoyi Qin, Yuke Lin, Ning Jiang, Guoqing, Zhao, Ming Li

TL;DR
This paper presents a diarization system for VoxCeleb Challenge 2023 that combines multiple models and fusion techniques, achieving the top DER score and setting a new benchmark in speaker diarization performance.
Contribution
The paper introduces a multi-model fusion approach for speaker diarization that outperforms previous methods and secures first place in the challenge.
Findings
Achieved 4.30% DER, the best in the challenge.
Fused multiple models using DOVER-Lap for improved accuracy.
System includes VAD, clustering, overlapped speech detection, and TSVAD.
Abstract
This paper describes the DKU-MSXF submission to track 4 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). Our system pipeline contains voice activity detection, clustering-based diarization, overlapped speech detection, and target-speaker voice activity detection, where each procedure has a fused output from 3 sub-models. Finally, we fuse different clustering-based and TSVAD-based diarization systems using DOVER-Lap and achieve the 4.30% diarization error rate (DER), which ranks first place on track 4 of the challenge leaderboard.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
