Study on Inter and Intra Speaker Variability in Speaker Recognition
Anton Okhotnikov, Nikita Torgashov, Ivan Yakovlev, Pavel Malov and, Rostislav Makarov

TL;DR
This paper analyzes how inter- and intra-speaker variability affect neural network-based speaker recognition systems and provides guidelines for data collection, including releasing metadata to improve research practices.
Contribution
It offers an analysis of speaker variability dependencies in training data and releases utterance upload date metadata for the VoxTube dataset to aid data collection practices.
Findings
Dependency between inter- and intra-speaker variability analyzed
Release of upload date metadata for VoxTube dataset
Guidelines for data collection and filtering
Abstract
Optimization of a trade-off between the number of speakers and their temporal variability (or session diversity) is crucial for the development of a speaker recognition system together with making the data collection process feasible from a time perspective. In this article, we provide the analysis of dependency between inter and intra speaker variability in training data for the modern neural network-based speaker recognition system using the VoxTube dataset for text-independent speaker recognition task. Besides, an auxiliary contribution of this work is a release of upload date metadata per utterance in a VoxTube dataset. We want this article to contribute to guidelines and best practices for collecting and filtering data from media hosting platforms to facilitate the efforts of researchers in developing speaker recognition systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Algorithms and Applications
