Evaluation of Speech Representations for MOS prediction
Frederico S. Oliveira, Edresson Casanova, Arnaldo C\^andido J\'unior,, Lucas R. S. Gris, Anderson S. Soares, and Arlindo R. Galv\~ao Filho

TL;DR
This study evaluates various speech feature extraction models, especially Whisper, for predicting speech quality (MOS), demonstrating that Whisper models perform best across datasets and are suitable for real-time applications.
Contribution
The paper introduces a comprehensive comparison of supervised, self-supervised, and speaker verification models for MOS prediction, highlighting Whisper's effectiveness and proposing a new evaluation dataset.
Findings
Whisper models outperform others in MOS prediction across datasets.
Whisper-Small achieves a linear correlation of 0.6980 on BRSpeechMOS.
SpeakerNet, a speaker verification model, is suitable for real-time applications with 5M parameters.
Abstract
In this paper, we evaluate feature extraction models for predicting speech quality. We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models to predict the metric MOS. Our experiments were performed on the VCC2018 dataset and a Brazilian-Portuguese dataset called BRSpeechMOS, which was created for this work. The results show that the Whisper model is appropriate in all scenarios: with both the VCC2018 and BRSpeech- MOS datasets. Among the supervised and self-supervised learning models using BRSpeechMOS, Whisper-Small achieved the best linear correlation of 0.6980, and the speaker verification model, SpeakerNet, had linear correlation of 0.6963. Using VCC2018, the best supervised and self-supervised learning model, Whisper-Large, achieved linear correlation of 0.7274, and the best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
