The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains
Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda,, Junichi Yamagishi

TL;DR
The VoiceMOS Challenge 2023 focused on advancing zero-shot, out-of-domain speech quality prediction across multiple voice synthesis scenarios, highlighting the effectiveness of diverse datasets and listener data.
Contribution
This paper introduces the second VoiceMOS Challenge emphasizing real-world zero-shot speech quality prediction with multiple evaluation tracks and diverse participant approaches.
Findings
Large differences in predictability between French TTS sub-tracks
Singing voice-converted samples were easier to predict than expected
Using diverse datasets and listener info improved prediction accuracy
Abstract
We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech. This year, we emphasize real-world and challenging zero-shot out-of-domain MOS prediction with three tracks for three different voice evaluation scenarios. Ten teams from industry and academia in seven different countries participated. Surprisingly, we found that the two sub-tracks of French text-to-speech synthesis had large differences in their predictability, and that singing voice-converted samples were not as difficult to predict as we had expected. Use of diverse datasets and listener information during training appeared to be successful approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Natural Language Processing Techniques
