Automatic Evaluation of Speaker Similarity

Deja Kamil; Sanchez Ariadna; Roth Julian; Cotescu Marius

arXiv:2207.00344·cs.SD·July 4, 2022

Automatic Evaluation of Speaker Similarity

Deja Kamil, Sanchez Ariadna, Roth Julian, Cotescu Marius

PDF

Open Access

TL;DR

This paper presents an automatic, neural network-based method for evaluating speaker similarity in speech synthesis, aligning well with human perceptual scores and reducing the need for costly perceptual tests.

Contribution

It introduces a novel automatic evaluation approach for speaker similarity that leverages speaker verification models and correlates strongly with human perceptual scores.

Findings

01

Achieves 0.96 accuracy in predicting MUSHRA scores

02

Correlates with human scores with up to 0.78 Pearson correlation

03

Reduces reliance on perceptual evaluations for speaker similarity

Abstract

We introduce a new automatic evaluation method for speaker similarity assessment, that is consistent with human perceptual scores. Modern neural text-to-speech models require a vast amount of clean training data, which is why many solutions switch from single speaker models to solutions trained on examples from many different speakers. Multi-speaker models bring new possibilities, such as a faster creation of new voices, but also a new problem - speaker leakage, where the speaker identity of a synthesized example might not match those of the target speaker. Currently, the only way to discover this issue is through costly perceptual evaluations. In this work, we propose an automatic method for assessment of speaker similarity. For that purpose, we extend the recent work on speaker verification systems and evaluate how different metrics and speaker embeddings models reflect Multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing