Using Rater and System Metadata to Explain Variance in the VoiceMOS   Challenge 2022 Dataset

Michael Chinen; Jan Skoglund; Chandan K A Reddy; Alessandro Ragano,; Andrew Hines

arXiv:2209.06358·cs.SD·September 15, 2022

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

Michael Chinen, Jan Skoglund, Chandan K A Reddy, Alessandro Ragano,, Andrew Hines

PDF

Open Access

TL;DR

This paper explores how metadata and dataset distribution affect speech quality model performance on the VoiceMOS 2022 dataset, demonstrating that metadata can explain variance and influence metric reliability.

Contribution

It introduces a speech quality model using wav2vec 2.0 with metadata features, achieving high correlation scores, and analyzes how dataset conditions impact metric interpretation.

Findings

01

Metadata improves speech quality prediction accuracy.

02

System-level metrics are affected by utterance count imbalance.

03

Balanced datasets yield more reliable utterance-level metrics.

Abstract

Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders

MethodsTest