The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction
Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario,, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao

TL;DR
The VoiceMOS Challenge 2024 aims to push forward automatic speech quality prediction through three diverse tracks, demonstrating that innovative methods can outperform baselines and advance subjective speech rating research.
Contribution
This is the third edition of the VoiceMOS Challenge, introducing new tracks and demonstrating the effectiveness of retrieval-based and non-self-supervised methods in speech quality prediction.
Findings
Many teams outperformed baseline systems.
Retrieval-based methods were effective.
Non-self-supervised representations improved predictions.
Abstract
We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of ``zoomed-in'' high-quality samples from speech synthesis systems. The second track was to predict ratings of samples from singing voice synthesis and voice conversion with a large variety of systems, listeners, and languages. The third track was semi-supervised quality prediction for noisy, clean, and enhanced speech, where a very small amount of labeled training data was provided. Among the eight teams from both academia and industry, we found that many were able to outperform the baseline systems. Successful techniques included retrieval-based methods and the use of non-self-supervised representations like spectrograms and pitch histograms. These results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
