PRESENT: Zero-Shot Text-to-Prosody Control
Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien, Herremans

TL;DR
PRESENT introduces a zero-shot prosody control method for TTS that modifies inference without extra style embeddings, enabling effective cross-lingual transfer and subphoneme-level control, including tonal languages.
Contribution
It presents a novel inference-based approach for zero-shot prosody control that does not require additional training or style embeddings, extending TTS capabilities across languages and granularities.
Findings
Achieves over 2x lower CER than previous state-of-the-art in German, Hungarian, and Spanish.
Enables subphoneme-level prosody control, improving question intonation and tonal language synthesis.
Demonstrates effective zero-shot transfer to Mandarin with low CERs.
Abstract
Current strategies for achieving fine-grained prosody control in speech synthesis entail extracting additional style embeddings or adopting more complex architectures. To enable zero-shot application of pretrained text-to-speech (TTS) models, we present PRESENT (PRosody Editing without Style Embeddings or New Training), which exploits explicit prosody prediction in FastSpeech2-based models by modifying the inference process directly. We apply our text-to-prosody framework to zero-shot language transfer using a JETS model exclusively trained on English LJSpeech data. We obtain character error rates (CER) of 12.8%, 18.7% and 5.9% for German, Hungarian and Spanish respectively, beating the previous state-of-the-art CER by over 2x for all three languages. Furthermore, we allow subphoneme-level control, a first in this field. To evaluate its effectiveness, we show that PRESENT can improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
