VoxSim: A perceptual voice similarity dataset

Junseok Ahn; Youkyum Kim; Yeunju Choi; Doyeop Kwak; Ji-Hoon Kim,; Seongkyu Mun; Joon Son Chung

arXiv:2407.18505·eess.AS·July 29, 2024·Interspeech

VoxSim: A perceptual voice similarity dataset

Junseok Ahn, Youkyum Kim, Yeunju Choi, Doyeop Kwak, Ji-Hoon Kim,, Seongkyu Mun, Joon Son Chung

PDF

Open Access 1 Repo

TL;DR

VoxSim is a new dataset with 70,000 perceptual voice similarity ratings derived from VoxCeleb, enabling improved speaker similarity prediction models and benchmarking in speech synthesis evaluation.

Contribution

The paper introduces VoxSim, a large-scale perceptual voice similarity dataset with extensive ratings, filling a gap in training data for speaker similarity assessment.

Findings

01

Baseline models achieve promising speaker similarity prediction accuracy.

02

Models trained on VoxSim generalize well to out-of-domain datasets.

03

VoxSim facilitates benchmarking and development of speaker similarity models.

Abstract

This paper introduces VoxSim, a dataset of perceptual voice similarity ratings. Recent efforts to automate the assessment of speech synthesis technologies have primarily focused on predicting mean opinion score of naturalness, leaving speaker voice similarity relatively unexplored due to a lack of extensive training data. To address this, we generate about 41k utterance pairs from the VoxCeleb dataset, a widely utilised speech dataset for speaker recognition, and collect nearly 70k speaker similarity scores through a listening test. VoxSim offers a valuable resource for the development and benchmarking of speaker similarity prediction models. We provide baseline results of speaker similarity prediction models on the VoxSim test set and further demonstrate that the model trained on our dataset generalises to the out-of-domain VCC2018 dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kaistmm/voxsim_trainer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsSparse Evolutionary Training