Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using   Prosodic and Linguistic Features

Alexandra Vioni; Georgia Maniati; Nikolaos Ellinas; June Sig Sung,; Inchul Hwang; Aimilios Chalamandaris; Pirros Tsiakoulis

arXiv:2211.00342·cs.SD·May 9, 2023

Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

Alexandra Vioni, Georgia Maniati, Nikolaos Ellinas, June Sig Sung,, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis

PDF

Open Access

TL;DR

This paper enhances neural TTS MOS prediction by integrating prosodic and linguistic features, leading to improved correlation with human evaluations in high-quality speech synthesis.

Contribution

It introduces a novel approach that incorporates prosodic and linguistic features into MOS prediction models, improving their accuracy over spectral-only methods.

Findings

01

Additional features improve MOS prediction correlation

02

Prosodic and linguistic features benefit both utterance and system-level predictions

03

Enhanced models outperform spectral-only baselines

Abstract

Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with regard to the spoken content is a decisive factor for speech naturalness. For this reason, we propose to include prosodic and linguistic features as additional inputs in MOS prediction systems, and evaluate their impact on the prediction outcome. We consider phoneme level F0 and duration features as prosodic inputs, as well as Tacotron encoder outputs, POS tags and BERT embeddings as higher-level linguistic inputs. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems

MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Activation · Linear Layer · Max Pooling · Highway Layer · Highway Network · Tanh Activation · Convolution