VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge

Andrew Brown; Jaesung Huh; Joon Son Chung; Arsha Nagrani; Daniel; Garcia-Romero; Andrew Zisserman

arXiv:2201.04583·cs.SD·November 17, 2022·34 cites

VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge

Andrew Brown, Jaesung Huh, Joon Son Chung, Arsha Nagrani, Daniel, Garcia-Romero, Andrew Zisserman

PDF

Open Access

TL;DR

The VoxSRC 2021 challenge evaluated current speaker recognition and diarisation methods on unconstrained YouTube data, emphasizing multi-lingual capabilities and providing standardized datasets, baselines, and evaluation protocols.

Contribution

This paper introduces the third VoxCeleb Speaker Recognition Challenge, including new multi-lingual focus, standardized datasets, and baseline systems for benchmarking speaker recognition in the wild.

Findings

01

Baseline systems achieved competitive performance.

02

Multi-lingual data posed new challenges for recognition.

03

Progress since previous editions shows improved robustness.

Abstract

The third instalment of the VoxCeleb Speaker Recognition Challenge was held in conjunction with Interspeech 2021. The aim of this challenge was to assess how well current speaker recognition technology is able to diarise and recognise speakers in unconstrained or `in the wild' data. The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a virtual public challenge and workshop held at Interspeech 2021. This paper outlines the challenge, and describes the baselines, methods and results. We conclude with a discussion on the new multi-lingual focus of VoxSRC 2021, and on the progression of the challenge since the previous two editions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques