Enhancing Emotional Text-to-Speech Controllability with Natural Language   Guidance through Contrastive Learning and Diffusion Models

Xin Jing; Kun Zhou; Andreas Triantafyllopoulos; Bj\"orn W. Schuller

arXiv:2409.06451·cs.SD·September 11, 2024

Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models

Xin Jing, Kun Zhou, Andreas Triantafyllopoulos, Bj\"orn W. Schuller

PDF

Open Access

TL;DR

ParaEVITS is a novel emotional TTS framework that uses contrastive learning and diffusion models to enable fine-grained control over emotional speech rendering based solely on natural language descriptions.

Contribution

It introduces a new framework combining contrastive language-audio pretraining with diffusion models for improved emotional control in TTS.

Findings

01

Effective emotion control without quality loss

02

Manipulation of speech attributes via textual conditioning

03

Publicly available speech demos

Abstract

While current emotional text-to-speech (TTS) systems can generate highly intelligible emotional speech, achieving fine control over emotion rendering of the output speech still remains a significant challenge. In this paper, we introduce ParaEVITS, a novel emotional TTS framework that leverages the compositionality of natural language to enhance control over emotional rendering. By incorporating a text-audio encoder inspired by ParaCLAP, a contrastive language-audio pretraining (CLAP) model for computational paralinguistics, the diffusion model is trained to generate emotional embeddings based on textual emotional style descriptions. Our framework first trains on reference audio using the audio encoder, then fine-tunes a diffusion model to process textual inputs from ParaCLAP's text encoder. During inference, speech attributes such as pitch, jitter, and loudness are manipulated using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems