The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition   Challenge 2023

Ming Cheng; Weiqing Wang; Xiaoyi Qin; Yuke Lin; Ning Jiang; Guoqing; Zhao; Ming Li

arXiv:2308.07595·eess.AS·August 21, 2023

The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023

Ming Cheng, Weiqing Wang, Xiaoyi Qin, Yuke Lin, Ning Jiang, Guoqing, Zhao, Ming Li

PDF

Open Access

TL;DR

This paper presents a diarization system for VoxCeleb Challenge 2023 that combines multiple models and fusion techniques, achieving the top DER score and setting a new benchmark in speaker diarization performance.

Contribution

The paper introduces a multi-model fusion approach for speaker diarization that outperforms previous methods and secures first place in the challenge.

Findings

01

Achieved 4.30% DER, the best in the challenge.

02

Fused multiple models using DOVER-Lap for improved accuracy.

03

System includes VAD, clustering, overlapped speech detection, and TSVAD.

Abstract

This paper describes the DKU-MSXF submission to track 4 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). Our system pipeline contains voice activity detection, clustering-based diarization, overlapped speech detection, and target-speaker voice activity detection, where each procedure has a fused output from 3 sub-models. Finally, we fuse different clustering-based and TSVAD-based diarization systems using DOVER-Lap and achieve the 4.30% diarization error rate (DER), which ranks first place on track 4 of the challenge leaderboard.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing