VoxCeleb2: Deep Speaker Recognition
Joon Son Chung, Arsha Nagrani, Andrew Zisserman

TL;DR
This paper introduces VoxCeleb2, a large-scale audio-visual dataset for speaker recognition, and develops CNN models that outperform previous methods in recognizing speakers under noisy, real-world conditions.
Contribution
The creation of VoxCeleb2 dataset and the development of CNN models that significantly improve speaker recognition accuracy in challenging environments.
Findings
VoxCeleb2 contains over a million utterances from 6,000+ speakers.
CNN models trained on VoxCeleb2 outperform previous benchmarks.
Models show robustness in noisy, unconstrained conditions.
Abstract
The objective of this paper is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual speaker recognition dataset collected from open-source media. Using a fully automated pipeline, we curate VoxCeleb2 which contains over a million utterances from over 6,000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare Convolutional Neural Network (CNN) models and training strategies that can effectively recognise identities from voice under various conditions. The models trained on the VoxCeleb2 dataset surpass the performance of previous works on a benchmark dataset by a significant margin.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
