Selecting N-lowest scores for training MOS prediction models

Yuto Kondo; Hirokazu Kameoka; Kou Tanaka; Takuhiro Kaneko

arXiv:2506.18326·cs.SD·June 24, 2025

Selecting N-lowest scores for training MOS prediction models

Yuto Kondo, Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko

PDF

TL;DR

This paper proposes using the mean of the N-lowest opinion scores (N_low-MOS) for training speech quality prediction models, which better reflects human focus on poor-quality segments and improves model correlation with subjective ratings.

Contribution

It introduces N_low-MOS as a new, more reliable target for training MOS prediction models, emphasizing low-quality speech segments to enhance prediction accuracy.

Findings

01

N_low-MOS improves LCC and SRCC over regular MOS.

02

Using N_low-MOS yields a more intrinsic measure of speech quality.

03

The approach enhances MOSNet's ability to evaluate voice conversion models.

Abstract

The automatic speech quality assessment (SQA) has been extensively studied to predict the speech quality without time-consuming questionnaires. Recently, neural-based SQA models have been actively developed for speech samples produced by text-to-speech or voice conversion, with a primary focus on training mean opinion score (MOS) prediction models. The quality of each speech sample may not be consistent across the entire duration, and it remains unclear which segments of the speech receive the primary focus from humans when assigning subjective evaluation for MOS calculation. We hypothesize that when humans rate speech, they tend to assign more weight to low-quality speech segments, and the variance in ratings for each sample is mainly due to accidental assignment of higher scores when overlooking the poor quality speech segments. Motivated by the hypothesis, we analyze the VCC2018 and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.