Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models
Xin Jing, Kun Zhou, Andreas Triantafyllopoulos, Bj\"orn W. Schuller

TL;DR
ParaEVITS is a novel emotional TTS framework that uses contrastive learning and diffusion models to enable fine-grained control over emotional speech rendering based solely on natural language descriptions.
Contribution
It introduces a new framework combining contrastive language-audio pretraining with diffusion models for improved emotional control in TTS.
Findings
Effective emotion control without quality loss
Manipulation of speech attributes via textual conditioning
Publicly available speech demos
Abstract
While current emotional text-to-speech (TTS) systems can generate highly intelligible emotional speech, achieving fine control over emotion rendering of the output speech still remains a significant challenge. In this paper, we introduce ParaEVITS, a novel emotional TTS framework that leverages the compositionality of natural language to enhance control over emotional rendering. By incorporating a text-audio encoder inspired by ParaCLAP, a contrastive language-audio pretraining (CLAP) model for computational paralinguistics, the diffusion model is trained to generate emotional embeddings based on textual emotional style descriptions. Our framework first trains on reference audio using the audio encoder, then fine-tunes a diffusion model to process textual inputs from ParaCLAP's text encoder. During inference, speech attributes such as pitch, jitter, and loudness are manipulated using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
