A Comparison of Deep Learning MOS Predictors for Speech Synthesis   Quality

Alessandro Ragano; Emmanouil Benetos; Michael Chinen; Helard B.; Martinez; Chandan K.A. Reddy; Jan Skoglund; Andrew Hines

arXiv:2204.02249·eess.AS·November 27, 2023

A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality

Alessandro Ragano, Emmanouil Benetos, Michael Chinen, Helard B., Martinez, Chandan K.A. Reddy, Jan Skoglund, Andrew Hines

PDF

Open Access

TL;DR

This paper compares various deep learning MOS predictors for speech synthesis quality, highlighting the importance of data and model features, and emphasizing that statistical metrics alone may not fully capture model effectiveness.

Contribution

It evaluates SSL-based and supervised MOS predictors, revealing SSL models' superior correlation and error metrics, and discusses the limitations of benchmarking solely on statistical performance.

Findings

01

SSL models outperform supervised models in correlation and error metrics

02

Data quality and cross-domain features significantly impact predictor performance

03

Benchmarking metrics may be biased by hidden data issues

Abstract

Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing