Speech Quality Assessment through MOS using Non-Matching References
Pranay Manocha, Anurag Kumar

TL;DR
This paper introduces NORESQA-MOS, a novel neural network framework that uses non-matching references to improve the robustness and generalization of speech quality assessment via MOS estimation, outperforming existing methods.
Contribution
The paper presents a new MOS estimation framework using non-matching references, enhancing robustness and generalization over prior deep learning approaches.
Findings
NORESQA-MOS outperforms DNSMOS and NISQA in generalization.
It requires a smaller training set.
It can be combined with self-supervised learning methods.
Abstract
Human judgments obtained through Mean Opinion Scores (MOS) are the most reliable way to assess the quality of speech signals. However, several recent attempts to automatically estimate MOS using deep learning approaches lack robustness and generalization capabilities, limiting their use in real-world applications. In this work, we present a novel framework, NORESQA-MOS, for estimating the MOS of a speech signal. Unlike prior works, our approach uses non-matching references as a form of conditioning to ground the MOS estimation by neural networks. We show that NORESQA-MOS provides better generalization and more robust MOS estimation than previous state-of-the-art methods such as DNSMOS and NISQA, even though we use a smaller training set. Moreover, we also show that our generic framework can be combined with other learning methods such as self-supervised learning and can further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
