Training speaker recognition systems with limited data

Nik Vaessen; David A. van Leeuwen

arXiv:2203.14688·cs.SD·February 28, 2023

Training speaker recognition systems with limited data

Nik Vaessen, David A. van Leeuwen

PDF

Open Access 1 Repo

TL;DR

This paper investigates training speaker recognition neural networks with limited data, demonstrating that pre-trained wav2vec2 models significantly outperform other architectures under data constraints.

Contribution

It introduces three restricted VoxCeleb2 subsets and compares multiple architectures, highlighting the effectiveness of self-supervised pre-training in low-data scenarios.

Findings

01

Wav2vec2 with pre-trained weights outperforms other models on limited data.

02

Reduced datasets still enable effective speaker recognition with proper pre-training.

03

Self-supervised pre-training is crucial for low-resource speaker recognition.

Abstract

This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work. We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset. These subsets are restricted to 50\,k audio files (versus over 1\,M files available), and vary on the axis of number of speakers and session variability. We train three speaker recognition systems on these subsets; the X-vector, ECAPA-TDNN, and wav2vec2 network architectures. We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited. Code and data subsets are available at https://github.com/nikvaessen/w2v2-speaker-few-samples.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nikvaessen/w2v2-speaker-few-samples
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis