Uni-VERSA: Versatile Speech Assessment with a Unified Network
Jiatong Shi, Hye-Jin Shim, Shinji Watanabe

TL;DR
Uni-VERSA is a unified neural network that predicts multiple speech quality metrics simultaneously, offering a comprehensive, scalable, and human-aligned alternative to traditional subjective listening tests.
Contribution
It introduces a novel multi-task framework for speech assessment that covers various quality aspects in a single model, improving efficiency and consistency.
Findings
Outperforms single-metric methods on the URGENT24 benchmark
Aligns closely with human perception of speech quality
Demonstrates versatility across speech enhancement and synthesis tasks
Abstract
Subjective listening tests remain the golden standard for speech quality assessment, but are costly, variable, and difficult to scale. In contrast, existing objective metrics, such as PESQ, F0 correlation, and DNSMOS, typically capture only specific aspects of speech quality. To address these limitations, we introduce Uni-VERSA, a unified network that simultaneously predicts various objective metrics, encompassing naturalness, intelligibility, speaker characteristics, prosody, and noise, for a comprehensive evaluation of speech signals. We formalize its framework, evaluation protocol, and applications in speech enhancement, synthesis, and quality control. A benchmark based on the URGENT24 challenge, along with a baseline leveraging self-supervised representations, demonstrates that Uni-VERSA provides a viable alternative to single-aspect evaluation methods. Moreover, it aligns closely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
