VoxSim: A perceptual voice similarity dataset
Junseok Ahn, Youkyum Kim, Yeunju Choi, Doyeop Kwak, Ji-Hoon Kim,, Seongkyu Mun, Joon Son Chung

TL;DR
VoxSim is a new dataset with 70,000 perceptual voice similarity ratings derived from VoxCeleb, enabling improved speaker similarity prediction models and benchmarking in speech synthesis evaluation.
Contribution
The paper introduces VoxSim, a large-scale perceptual voice similarity dataset with extensive ratings, filling a gap in training data for speaker similarity assessment.
Findings
Baseline models achieve promising speaker similarity prediction accuracy.
Models trained on VoxSim generalize well to out-of-domain datasets.
VoxSim facilitates benchmarking and development of speaker similarity models.
Abstract
This paper introduces VoxSim, a dataset of perceptual voice similarity ratings. Recent efforts to automate the assessment of speech synthesis technologies have primarily focused on predicting mean opinion score of naturalness, leaving speaker voice similarity relatively unexplored due to a lack of extensive training data. To address this, we generate about 41k utterance pairs from the VoxCeleb dataset, a widely utilised speech dataset for speaker recognition, and collect nearly 70k speaker similarity scores through a listening test. VoxSim offers a valuable resource for the development and benchmarking of speaker similarity prediction models. We provide baseline results of speaker similarity prediction models on the VoxSim test set and further demonstrate that the model trained on our dataset generalises to the out-of-domain VCC2018 dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsSparse Evolutionary Training
