PRESENT: Zero-Shot Text-to-Prosody Control

Perry Lam; Huayun Zhang; Nancy F. Chen; Berrak Sisman; Dorien; Herremans

arXiv:2408.06827·eess.AS·January 8, 2025

PRESENT: Zero-Shot Text-to-Prosody Control

Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien, Herremans

PDF

Open Access 1 Repo

TL;DR

PRESENT introduces a zero-shot prosody control method for TTS that modifies inference without extra style embeddings, enabling effective cross-lingual transfer and subphoneme-level control, including tonal languages.

Contribution

It presents a novel inference-based approach for zero-shot prosody control that does not require additional training or style embeddings, extending TTS capabilities across languages and granularities.

Findings

01

Achieves over 2x lower CER than previous state-of-the-art in German, Hungarian, and Spanish.

02

Enables subphoneme-level prosody control, improving question intonation and tonal language synthesis.

03

Demonstrates effective zero-shot transfer to Mandarin with low CERs.

Abstract

Current strategies for achieving fine-grained prosody control in speech synthesis entail extracting additional style embeddings or adopting more complex architectures. To enable zero-shot application of pretrained text-to-speech (TTS) models, we present PRESENT (PRosody Editing without Style Embeddings or New Training), which exploits explicit prosody prediction in FastSpeech2-based models by modifying the inference process directly. We apply our text-to-prosody framework to zero-shot language transfer using a JETS model exclusively trained on English LJSpeech data. We obtain character error rates (CER) of 12.8%, 18.7% and 5.9% for German, Hungarian and Spanish respectively, beating the previous state-of-the-art CER by over 2x for all three languages. Furthermore, we allow subphoneme-level control, a first in this field. To evaluate its effectiveness, we show that PRESENT can improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iamanigeeit/present
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis