VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set   Speaker-Identification Benchmark

Yuke Lin; Ming Cheng; Fulin Zhang; Yingying Gao; Shilei Zhang; Ming Li

arXiv:2407.11510·eess.AS·July 17, 2024·Interspeech

VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark

Yuke Lin, Ming Cheng, Fulin Zhang, Yingying Gao, Shilei Zhang, Ming Li

PDF

Open Access 1 Repo

TL;DR

VoxBlink2 introduces a large-scale audio-visual speaker recognition dataset with over 10 million utterances from 110,000+ speakers, achieving state-of-the-art verification performance and establishing a new open-set speaker identification benchmark.

Contribution

The paper presents VoxBlink2, a significantly expanded speaker recognition dataset, and establishes a new open-set speaker identification benchmark with novel evaluation protocols.

Findings

01

Achieved a new state-of-the-art EER of 0.170% on VoxCeleb1-O.

02

Demonstrated the impact of training strategies and data scale on verification performance.

03

Proposed the open-set speaker identification task and benchmark.

Abstract

In this paper, we provide a large audio-visual speaker recognition dataset, VoxBlink2, which includes approximately 10M utterances with videos from 110K+ speakers in the wild. This dataset represents a significant expansion over the VoxBlink dataset, encompassing a broader diversity of speakers and scenarios by the grace of an optimized data collection pipeline. Afterward, we explore the impact of training strategies, data scale, and model complexity on speaker verification and finally establish a new single-model state-of-the-art EER at 0.170% and minDCF at 0.006% on the VoxCeleb1-O test set. Such remarkable results motivate us to explore speaker recognition from a new challenging perspective. We raise the Open-Set Speaker-Identification task, which is designed to either match a probe utterance with a known gallery speaker or categorize it as an unknown query. Associated with this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wenet-e2e/wespeaker
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing