A Comparative Analysis of Pretrained Language Models for Text-to-Speech

Marcel Granero-Moya; Penny Karanasou; Sri Karlapati; Bastian Schnell,; Nicole Peinelt; Alexis Moinet; Thomas Drugman

arXiv:2309.01576·cs.CL·September 6, 2023

A Comparative Analysis of Pretrained Language Models for Text-to-Speech

Marcel Granero-Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell,, Nicole Peinelt, Alexis Moinet, Thomas Drugman

PDF

Open Access

TL;DR

This study systematically compares various pretrained language models for text-to-speech tasks, revealing how model size and type influence prosody and pause prediction quality, and establishing correlations with language understanding benchmarks.

Contribution

It is the first comprehensive analysis of how different PLMs affect TTS performance, focusing on prosody and pause prediction tasks.

Findings

01

Logarithmic relationship between model size and prosody prediction quality

02

Significant differences in prosody performance between neutral and expressive speech

03

Pause prediction less sensitive to small models and correlates with GLUE scores

Abstract

State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS tasks: prosody prediction and pause prediction. Firstly, we trained a prosody prediction model using 15 different PLMs. Our findings revealed a logarithmic relationship between model size and quality, as well as significant performance differences between neutral and expressive prosody. Secondly, we employed PLMs for pause prediction and found that the task was less sensitive to small models. We also identified a strong correlation between our empirical results and the GLUE scores obtained for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis