Training speaker recognition systems with limited data
Nik Vaessen, David A. van Leeuwen

TL;DR
This paper investigates training speaker recognition neural networks with limited data, demonstrating that pre-trained wav2vec2 models significantly outperform other architectures under data constraints.
Contribution
It introduces three restricted VoxCeleb2 subsets and compares multiple architectures, highlighting the effectiveness of self-supervised pre-training in low-data scenarios.
Findings
Wav2vec2 with pre-trained weights outperforms other models on limited data.
Reduced datasets still enable effective speaker recognition with proper pre-training.
Self-supervised pre-training is crucial for low-resource speaker recognition.
Abstract
This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work. We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset. These subsets are restricted to 50\,k audio files (versus over 1\,M files available), and vary on the axis of number of speakers and session variability. We train three speaker recognition systems on these subsets; the X-vector, ECAPA-TDNN, and wav2vec2 network architectures. We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited. Code and data subsets are available at https://github.com/nikvaessen/w2v2-speaker-few-samples.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
