VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark
Yuke Lin, Ming Cheng, Fulin Zhang, Yingying Gao, Shilei Zhang, Ming Li

TL;DR
VoxBlink2 introduces a large-scale audio-visual speaker recognition dataset with over 10 million utterances from 110,000+ speakers, achieving state-of-the-art verification performance and establishing a new open-set speaker identification benchmark.
Contribution
The paper presents VoxBlink2, a significantly expanded speaker recognition dataset, and establishes a new open-set speaker identification benchmark with novel evaluation protocols.
Findings
Achieved a new state-of-the-art EER of 0.170% on VoxCeleb1-O.
Demonstrated the impact of training strategies and data scale on verification performance.
Proposed the open-set speaker identification task and benchmark.
Abstract
In this paper, we provide a large audio-visual speaker recognition dataset, VoxBlink2, which includes approximately 10M utterances with videos from 110K+ speakers in the wild. This dataset represents a significant expansion over the VoxBlink dataset, encompassing a broader diversity of speakers and scenarios by the grace of an optimized data collection pipeline. Afterward, we explore the impact of training strategies, data scale, and model complexity on speaker verification and finally establish a new single-model state-of-the-art EER at 0.170% and minDCF at 0.006% on the VoxCeleb1-O test set. Such remarkable results motivate us to explore speaker recognition from a new challenging perspective. We raise the Open-Set Speaker-Identification task, which is designed to either match a probe utterance with a known gallery speaker or categorize it as an unknown query. Associated with this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
