The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction

Wen-Chin Huang; Szu-Wei Fu; Erica Cooper; Ryandhimas E. Zezario,; Tomoki Toda; Hsin-Min Wang; Junichi Yamagishi; Yu Tsao

arXiv:2409.07001·cs.SD·September 12, 2024

The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction

Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario,, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao

PDF

Open Access

TL;DR

The VoiceMOS Challenge 2024 aims to push forward automatic speech quality prediction through three diverse tracks, demonstrating that innovative methods can outperform baselines and advance subjective speech rating research.

Contribution

This is the third edition of the VoiceMOS Challenge, introducing new tracks and demonstrating the effectiveness of retrieval-based and non-self-supervised methods in speech quality prediction.

Findings

01

Many teams outperformed baseline systems.

02

Retrieval-based methods were effective.

03

Non-self-supervised representations improved predictions.

Abstract

We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of ``zoomed-in'' high-quality samples from speech synthesis systems. The second track was to predict ratings of samples from singing voice synthesis and voice conversion with a large variety of systems, listeners, and languages. The third track was semi-supervised quality prediction for noisy, clean, and enhanced speech, where a very small amount of labeled training data was provided. Among the eight teams from both academia and industry, we found that many were able to outperform the baseline systems. Successful techniques included retrieval-based methods and the use of non-self-supervised representations like spectrograms and pitch histograms. These results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing