Evaluation of Speech Representations for MOS prediction

Frederico S. Oliveira; Edresson Casanova; Arnaldo C\^andido J\'unior,; Lucas R. S. Gris; Anderson S. Soares; and Arlindo R. Galv\~ao Filho

arXiv:2306.09979·cs.SD·June 19, 2023·1 cites

Evaluation of Speech Representations for MOS prediction

Frederico S. Oliveira, Edresson Casanova, Arnaldo C\^andido J\'unior,, Lucas R. S. Gris, Anderson S. Soares, and Arlindo R. Galv\~ao Filho

PDF

Open Access 1 Repo

TL;DR

This study evaluates various speech feature extraction models, especially Whisper, for predicting speech quality (MOS), demonstrating that Whisper models perform best across datasets and are suitable for real-time applications.

Contribution

The paper introduces a comprehensive comparison of supervised, self-supervised, and speaker verification models for MOS prediction, highlighting Whisper's effectiveness and proposing a new evaluation dataset.

Findings

01

Whisper models outperform others in MOS prediction across datasets.

02

Whisper-Small achieves a linear correlation of 0.6980 on BRSpeechMOS.

03

SpeakerNet, a speaker verification model, is suitable for real-time applications with 5M parameters.

Abstract

In this paper, we evaluate feature extraction models for predicting speech quality. We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models to predict the metric MOS. Our experiments were performed on the VCC2018 dataset and a Brazilian-Portuguese dataset called BRSpeechMOS, which was created for this work. The results show that the Whisper model is appropriate in all scenarios: with both the VCC2018 and BRSpeech- MOS datasets. Among the supervised and self-supervised learning models using BRSpeechMOS, Whisper-Small achieved the best linear correlation of 0.6980, and the speaker verification model, SpeakerNet, had linear correlation of 0.6963. Using VCC2018, the best supervised and self-supervised learning model, Whisper-Large, achieved linear correlation of 0.7274, and the best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

freds0/bspeech-mos-prediction
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders