Generalization Ability of MOS Prediction Networks
Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi

TL;DR
This paper investigates the generalization of MOS prediction networks across different listening test contexts, showing that fine-tuned wav2vec2 models can effectively predict out-of-domain speech ratings, especially at the utterance level.
Contribution
The study evaluates various MOS prediction networks, including wav2vec2, demonstrating their ability to generalize across diverse listening test scenarios and highlighting the benefits of fine-tuning.
Findings
Wav2vec2 models fine-tuned for MOS prediction generalize well to out-of-domain data.
Fine-tuning improves prediction accuracy on in-domain data.
Unseen speech synthesis systems pose challenges for MOS prediction models.
Abstract
Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics of the speech, and even the instructions given and the rating scale all vary from test to test. While automatic predictors for metrics such as mean opinion score (MOS) can achieve high prediction accuracy on samples from the same test, they typically fail to generalize well to new listening test contexts. In this paper, using a variety of networks for MOS prediction including MOSNet and self-supervised speech models such as wav2vec2, we investigate their performance on data from different listening tests in both zero-shot and fine-tuned settings. We find that wav2vec2 models fine-tuned for MOS prediction have good generalization capability to out-of-domain data even for the most challenging case of utterance-level predictions in the zero-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and Audio Processing
MethodsTest
