The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality   Prediction for Multiple Domains

Erica Cooper; Wen-Chin Huang; Yu Tsao; Hsin-Min Wang; Tomoki Toda,; Junichi Yamagishi

arXiv:2310.02640·eess.AS·October 10, 2023·ASRU·2 cites

The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains

Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda,, Junichi Yamagishi

PDF

Open Access

TL;DR

The VoiceMOS Challenge 2023 focused on advancing zero-shot, out-of-domain speech quality prediction across multiple voice synthesis scenarios, highlighting the effectiveness of diverse datasets and listener data.

Contribution

This paper introduces the second VoiceMOS Challenge emphasizing real-world zero-shot speech quality prediction with multiple evaluation tracks and diverse participant approaches.

Findings

01

Large differences in predictability between French TTS sub-tracks

02

Singing voice-converted samples were easier to predict than expected

03

Using diverse datasets and listener info improved prediction accuracy

Abstract

We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech. This year, we emphasize real-world and challenging zero-shot out-of-domain MOS prediction with three tracks for three different voice evaluation scenarios. Ten teams from industry and academia in seven different countries participated. Surprisingly, we found that the two sub-tracks of French text-to-speech synthesis had large differences in their predictability, and that singing voice-converted samples were not as difficult to predict as we had expected. Use of diverse datasets and listener information during training appeared to be successful approaches.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Natural Language Processing Techniques