Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features
Alexandra Vioni, Georgia Maniati, Nikolaos Ellinas, June Sig Sung,, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis

TL;DR
This paper enhances neural TTS MOS prediction by integrating prosodic and linguistic features, leading to improved correlation with human evaluations in high-quality speech synthesis.
Contribution
It introduces a novel approach that incorporates prosodic and linguistic features into MOS prediction models, improving their accuracy over spectral-only methods.
Findings
Additional features improve MOS prediction correlation
Prosodic and linguistic features benefit both utterance and system-level predictions
Enhanced models outperform spectral-only baselines
Abstract
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with regard to the spoken content is a decisive factor for speech naturalness. For this reason, we propose to include prosodic and linguistic features as additional inputs in MOS prediction systems, and evaluate their impact on the prediction outcome. We consider phoneme level F0 and duration features as prosodic inputs, as well as Tacotron encoder outputs, POS tags and BERT embeddings as higher-level linguistic inputs. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems
MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Activation · Linear Layer · Max Pooling · Highway Layer · Highway Network · Tanh Activation · Convolution
