Selecting N-lowest scores for training MOS prediction models
Yuto Kondo, Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko

TL;DR
This paper proposes using the mean of the N-lowest opinion scores (N_low-MOS) for training speech quality prediction models, which better reflects human focus on poor-quality segments and improves model correlation with subjective ratings.
Contribution
It introduces N_low-MOS as a new, more reliable target for training MOS prediction models, emphasizing low-quality speech segments to enhance prediction accuracy.
Findings
N_low-MOS improves LCC and SRCC over regular MOS.
Using N_low-MOS yields a more intrinsic measure of speech quality.
The approach enhances MOSNet's ability to evaluate voice conversion models.
Abstract
The automatic speech quality assessment (SQA) has been extensively studied to predict the speech quality without time-consuming questionnaires. Recently, neural-based SQA models have been actively developed for speech samples produced by text-to-speech or voice conversion, with a primary focus on training mean opinion score (MOS) prediction models. The quality of each speech sample may not be consistent across the entire duration, and it remains unclear which segments of the speech receive the primary focus from humans when assigning subjective evaluation for MOS calculation. We hypothesize that when humans rate speech, they tend to assign more weight to low-quality speech segments, and the variance in ratings for each sample is mainly due to accidental assignment of higher scores when overlooking the poor quality speech segments. Motivated by the hypothesis, we analyze the VCC2018 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
