Microsoft Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2020
Xiong Xiao, Naoyuki Kanda, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka,, Sanyuan Chen, Yong Zhao, Gang Liu, Yu Wu, Jian Wu, Shujie Liu, Jinyu Li,, Yifan Gong

TL;DR
This paper presents Microsoft's advanced speaker diarization system for multi-talker recordings, achieving top performance in the VoxCeleb Speaker Recognition Challenge 2020 by integrating novel components like Res2Net embeddings and conformer-based separation.
Contribution
The paper introduces a novel diarization system combining Res2Net embeddings, conformer-based separation, and a modified DOVER method, tailored for real-world multi-talker audio in challenging conditions.
Findings
Achieved 3.71% DER on development set
Achieved 6.23% DER on evaluation set
Ranked 1st in the challenge
Abstract
This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in the wild, evaluated at the diarization track of the VoxCeleb Speaker Recognition Challenge(VoxSRC) 2020. We will first explain our system design to address issues in handling real multi-talker recordings. We then present the details of the components, which include Res2Net-based speaker embedding extractor, conformer-based continuous speech separation with leakage filtering, and a modified DOVER (short for Diarization Output Voting Error Reduction) method for system fusion. We evaluate the systems with the data set provided by VoxSRCchallenge 2020, which contains real-life multi-talker audio collected from YouTube. Our best system achieves 3.71% and 6.23% of the diarization error rate (DER) on development set and evaluation set, respectively, being ranked the 1st at the diarization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
