SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations   and Acoustic Features

Yu-Fei Shi; Yang Ai; Ye-Xin Lu; Hui-Peng Du; Zhen-Hua Ling

arXiv:2411.11232·cs.SD·November 19, 2024

SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic Features

Yu-Fei Shi, Yang Ai, Ye-Xin Lu, Hui-Peng Du, Zhen-Hua Ling

PDF

Open Access

TL;DR

SAMOS is a novel MOS prediction model that combines semantic representations from wav2vec2 and acoustic features from BiVocoder to improve speech naturalness assessment accuracy over existing models.

Contribution

The paper introduces SAMOS, integrating semantic and acoustic features for MOS prediction, enhancing accuracy beyond prior models that used limited speech information.

Findings

01

Outperforms state-of-the-art models on BVCC dataset

02

Achieves comparable performance on BC2019 dataset

03

Utilizes pretrained models for feature extraction

Abstract

Assessing the naturalness of speech using mean opinion score (MOS) prediction models has positive implications for the automatic evaluation of speech synthesis systems. Early MOS prediction models took the raw waveform or amplitude spectrum of speech as input, whereas more advanced methods employed self-supervised-learning (SSL) based models to extract semantic representations from speech for MOS prediction. These methods utilized limited aspects of speech information for MOS prediction, resulting in restricted prediction accuracy. Therefore, in this paper, we propose SAMOS, a MOS prediction model that leverages both Semantic and Acoustic information of speech to be assessed. Specifically, the proposed SAMOS leverages a pretrained wav2vec2 to extract semantic representations and uses the feature extractor of a pretrained BiVocoder to extract acoustic features. These two types of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing