VoxCeleb2: Deep Speaker Recognition

Joon Son Chung; Arsha Nagrani; Andrew Zisserman

arXiv:1806.05622·cs.SD·November 5, 2020

VoxCeleb2: Deep Speaker Recognition

Joon Son Chung, Arsha Nagrani, Andrew Zisserman

PDF

2 Repos 1 Datasets

TL;DR

This paper introduces VoxCeleb2, a large-scale audio-visual dataset for speaker recognition, and develops CNN models that outperform previous methods in recognizing speakers under noisy, real-world conditions.

Contribution

The creation of VoxCeleb2 dataset and the development of CNN models that significantly improve speaker recognition accuracy in challenging environments.

Findings

01

VoxCeleb2 contains over a million utterances from 6,000+ speakers.

02

CNN models trained on VoxCeleb2 outperform previous benchmarks.

03

Models show robustness in noisy, unconstrained conditions.

Abstract

The objective of this paper is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual speaker recognition dataset collected from open-source media. Using a fully automated pipeline, we curate VoxCeleb2 which contains over a million utterances from over 6,000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare Convolutional Neural Network (CNN) models and training strategies that can effectively recognise identities from voice under various conditions. The models trained on the VoxCeleb2 dataset surpass the performance of previous works on a benchmark dataset by a significant margin.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

gaunernst/voxceleb2-dev-wds
dataset· 376 dl
376 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.