Speech MOS multi-task learning and rater bias correction
Haleh Akrami, Hannes Gamper

TL;DR
This paper introduces a multi-task learning framework for blind speech MOS estimation that leverages additional labels like reverberation time and clarity, and addresses rater bias for improved speech quality assessment.
Contribution
It proposes a novel multi-task and semi-supervised learning approach to enhance blind MOS estimation and begins to tackle individual rater bias correction.
Findings
Joint estimation of MOS, T60, and C50 improves performance.
Combining different MOS datasets enhances model robustness.
Preliminary bias correction shows potential benefits.
Abstract
Perceptual speech quality is an important performance metric for teleconferencing applications. The mean opinion score (MOS) is standardized for the perceptual evaluation of speech quality and is obtained by asking listeners to rate the quality of a speech sample. Recently, there has been increasing research interest in developing models for estimating MOS blindly. Here we propose a multi-task framework to include additional labels and data in training to improve the performance of a blind MOS estimation model. Experimental results indicate that the proposed model can be trained to jointly estimate MOS, reverberation time (T60), and clarity (C50) by combining two disjoint data sets in training, one containing only MOS labels and the other containing only T60 and C50 labels. Furthermore, we use a semi-supervised framework to combine two MOS data sets in training, one containing only MOS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Blind Source Separation Techniques
