UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment

Wei Wang; Wangyou Zhang; Chenda Li; Jiahe Wang; Samuele Cornell; Marvin Sach; Kohei Saijo; Yihui Fu; Zhaoheng Ni; Bing Han; Xun Gong; Mengxiao Bi; Tim Fingscheidt; Shinji Watanabe; Yanmin Qian

arXiv:2601.18438·cs.SD·January 27, 2026

UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment

Wei Wang, Wangyou Zhang, Chenda Li, Jiahe Wang, Samuele Cornell, Marvin Sach, Kohei Saijo, Yihui Fu, Zhaoheng Ni, Bing Han, Xun Gong, Mengxiao Bi, Tim Fingscheidt, Shinji Watanabe, Yanmin Qian

PDF

Open Access

TL;DR

UrgentMOS is a unified framework for speech quality assessment that leverages diverse metrics and preferences, improving robustness and performance across various speech datasets and evaluation scenarios.

Contribution

It introduces a novel multi-metric and preference learning approach that handles heterogeneous supervision and partial annotations for robust speech quality evaluation.

Findings

01

Achieves state-of-the-art results in absolute speech quality prediction.

02

Effectively models pairwise preferences for comparative evaluation.

03

Demonstrates robustness across diverse speech datasets and distortions.

Abstract

Automatic speech quality assessment has become increasingly important as modern speech generation systems continue to advance, while human listening tests remain costly, time-consuming, and difficult to scale. Most existing learning-based assessment models rely primarily on scarce human-annotated mean opinion score (MOS) data, which limits robustness and generalization, especially when training across heterogeneous datasets. In this work, we propose UrgentMOS, a unified speech quality assessment framework that jointly learns from diverse objective and perceptual quality metrics, while explicitly tolerating the absence of arbitrary subsets of metrics during training. By leveraging complementary quality facets under heterogeneous supervision, UrgentMOS enables effective utilization of partially annotated data and improves robustness when trained on large-scale, multi-source datasets.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Image and Video Quality Assessment