A Comparative Analysis of Pretrained Language Models for Text-to-Speech
Marcel Granero-Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell,, Nicole Peinelt, Alexis Moinet, Thomas Drugman

TL;DR
This study systematically compares various pretrained language models for text-to-speech tasks, revealing how model size and type influence prosody and pause prediction quality, and establishing correlations with language understanding benchmarks.
Contribution
It is the first comprehensive analysis of how different PLMs affect TTS performance, focusing on prosody and pause prediction tasks.
Findings
Logarithmic relationship between model size and prosody prediction quality
Significant differences in prosody performance between neutral and expressive speech
Pause prediction less sensitive to small models and correlates with GLUE scores
Abstract
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS tasks: prosody prediction and pause prediction. Firstly, we trained a prosody prediction model using 15 different PLMs. Our findings revealed a logarithmic relationship between model size and quality, as well as significant performance differences between neutral and expressive prosody. Secondly, we employed PLMs for pause prediction and found that the task was less sensitive to small models. We also identified a strong correlation between our empirical results and the GLUE scores obtained for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
