A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality
Alessandro Ragano, Emmanouil Benetos, Michael Chinen, Helard B., Martinez, Chandan K.A. Reddy, Jan Skoglund, Andrew Hines

TL;DR
This paper compares various deep learning MOS predictors for speech synthesis quality, highlighting the importance of data and model features, and emphasizing that statistical metrics alone may not fully capture model effectiveness.
Contribution
It evaluates SSL-based and supervised MOS predictors, revealing SSL models' superior correlation and error metrics, and discusses the limitations of benchmarking solely on statistical performance.
Findings
SSL models outperform supervised models in correlation and error metrics
Data quality and cross-domain features significantly impact predictor performance
Benchmarking metrics may be biased by hidden data issues
Abstract
Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
