VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge
Andrew Brown, Jaesung Huh, Joon Son Chung, Arsha Nagrani, Daniel, Garcia-Romero, Andrew Zisserman

TL;DR
The VoxSRC 2021 challenge evaluated current speaker recognition and diarisation methods on unconstrained YouTube data, emphasizing multi-lingual capabilities and providing standardized datasets, baselines, and evaluation protocols.
Contribution
This paper introduces the third VoxCeleb Speaker Recognition Challenge, including new multi-lingual focus, standardized datasets, and baseline systems for benchmarking speaker recognition in the wild.
Findings
Baseline systems achieved competitive performance.
Multi-lingual data posed new challenges for recognition.
Progress since previous editions shows improved robustness.
Abstract
The third instalment of the VoxCeleb Speaker Recognition Challenge was held in conjunction with Interspeech 2021. The aim of this challenge was to assess how well current speaker recognition technology is able to diarise and recognise speakers in unconstrained or `in the wild' data. The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a virtual public challenge and workshop held at Interspeech 2021. This paper outlines the challenge, and describes the baselines, methods and results. We conclude with a discussion on the new multi-lingual focus of VoxSRC 2021, and on the progression of the challenge since the previous two editions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
