Generalization Ability of MOS Prediction Networks

Erica Cooper; Wen-Chin Huang; Tomoki Toda; Junichi Yamagishi

arXiv:2110.02635·eess.AS·February 15, 2022·ICASSP

Generalization Ability of MOS Prediction Networks

Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi

PDF

Open Access 1 Repo

TL;DR

This paper investigates the generalization of MOS prediction networks across different listening test contexts, showing that fine-tuned wav2vec2 models can effectively predict out-of-domain speech ratings, especially at the utterance level.

Contribution

The study evaluates various MOS prediction networks, including wav2vec2, demonstrating their ability to generalize across diverse listening test scenarios and highlighting the benefits of fine-tuning.

Findings

01

Wav2vec2 models fine-tuned for MOS prediction generalize well to out-of-domain data.

02

Fine-tuning improves prediction accuracy on in-domain data.

03

Unseen speech synthesis systems pose challenges for MOS prediction models.

Abstract

Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics of the speech, and even the instructions given and the rating scale all vary from test to test. While automatic predictors for metrics such as mean opinion score (MOS) can achieve high prediction accuracy on samples from the same test, they typically fail to generalize well to new listening test contexts. In this paper, using a variety of networks for MOS prediction including MOSNet and self-supervised speech models such as wav2vec2, we investigate their performance on data from different listening tests in both zero-shot and fine-tuned settings. We find that wav2vec2 models fine-tuned for MOS prediction have good generalization capability to out-of-domain data even for the most challenging case of utterance-level predictions in the zero-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nii-yamagishilab/mos-finetune-ssl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and Audio Processing

MethodsTest