SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction

Saurabh Agrawal; Raj Gohil; Gopal Kumar Agrawal; Vikram C M; Kushal Verma

arXiv:2506.02082·cs.SD·June 4, 2025

SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction

Saurabh Agrawal, Raj Gohil, Gopal Kumar Agrawal, Vikram C M, Kushal Verma

PDF

TL;DR

This paper introduces SALF-MOS, a scalable and generalized deep learning model that predicts speech quality scores, reducing reliance on manual subjective evaluations and improving efficiency in voice synthesis assessment.

Contribution

The paper presents a novel, end-to-end, speaker-agnostic model for MOS prediction that outperforms existing metrics in accuracy and scalability.

Findings

01

Achieved state-of-the-art results in MOS prediction metrics.

02

Demonstrated high generalization across different speakers and speech samples.

03

Reduced manual effort in speech quality assessment.

Abstract

Speech quality assessment is a critical process in selecting text-to-speech synthesis (TTS) or voice conversion models. Evaluation of voice synthesis can be done using objective metrics or subjective metrics. Although there are many objective metrics like the Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA) or Short-Time Objective Intelligibility (STOI) but none of them is feasible in selecting the best model. On the other hand subjective metric like Mean Opinion Score is highly reliable but it requires a lot of manual efforts and are time-consuming. To counter the issues in MOS Evaluation, we have developed a novel model, Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) which is a small-sized, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5. We use the sequences of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.