Microsoft Speaker Diarization System for the VoxCeleb Speaker   Recognition Challenge 2020

Xiong Xiao; Naoyuki Kanda; Zhuo Chen; Tianyan Zhou; Takuya Yoshioka,; Sanyuan Chen; Yong Zhao; Gang Liu; Yu Wu; Jian Wu; Shujie Liu; Jinyu Li,; Yifan Gong

arXiv:2010.11458·eess.AS·October 26, 2020·6 cites

Microsoft Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2020

Xiong Xiao, Naoyuki Kanda, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka,, Sanyuan Chen, Yong Zhao, Gang Liu, Yu Wu, Jian Wu, Shujie Liu, Jinyu Li,, Yifan Gong

PDF

Open Access

TL;DR

This paper presents Microsoft's advanced speaker diarization system for multi-talker recordings, achieving top performance in the VoxCeleb Speaker Recognition Challenge 2020 by integrating novel components like Res2Net embeddings and conformer-based separation.

Contribution

The paper introduces a novel diarization system combining Res2Net embeddings, conformer-based separation, and a modified DOVER method, tailored for real-world multi-talker audio in challenging conditions.

Findings

01

Achieved 3.71% DER on development set

02

Achieved 6.23% DER on evaluation set

03

Ranked 1st in the challenge

Abstract

This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in the wild, evaluated at the diarization track of the VoxCeleb Speaker Recognition Challenge(VoxSRC) 2020. We will first explain our system design to address issues in handling real multi-talker recordings. We then present the details of the components, which include Res2Net-based speaker embedding extractor, conformer-based continuous speech separation with leakage filtering, and a modified DOVER (short for Diarization Output Voting Error Reduction) method for system fusion. We evaluate the systems with the data set provided by VoxSRCchallenge 2020, which contains real-life multi-talker audio collected from YouTube. Our best system achieves 3.71% and 6.23% of the diarization error rate (DER) on development set and evaluation set, respectively, being ranked the 1st at the diarization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing