SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction
Saurabh Agrawal, Raj Gohil, Gopal Kumar Agrawal, Vikram C M, Kushal Verma

TL;DR
This paper introduces SALF-MOS, a scalable and generalized deep learning model that predicts speech quality scores, reducing reliance on manual subjective evaluations and improving efficiency in voice synthesis assessment.
Contribution
The paper presents a novel, end-to-end, speaker-agnostic model for MOS prediction that outperforms existing metrics in accuracy and scalability.
Findings
Achieved state-of-the-art results in MOS prediction metrics.
Demonstrated high generalization across different speakers and speech samples.
Reduced manual effort in speech quality assessment.
Abstract
Speech quality assessment is a critical process in selecting text-to-speech synthesis (TTS) or voice conversion models. Evaluation of voice synthesis can be done using objective metrics or subjective metrics. Although there are many objective metrics like the Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA) or Short-Time Objective Intelligibility (STOI) but none of them is feasible in selecting the best model. On the other hand subjective metric like Mean Opinion Score is highly reliable but it requires a lot of manual efforts and are time-consuming. To counter the issues in MOS Evaluation, we have developed a novel model, Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) which is a small-sized, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5. We use the sequences of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
