PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and   Intensity Control

Shaozuo Zhang; Ambuj Mehrish; Yingting Li; Soujanya Poria

arXiv:2501.06276·cs.SD·January 14, 2025

PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

Shaozuo Zhang, Ambuj Mehrish, Yingting Li, Soujanya Poria

PDF

TL;DR

PROEMO introduces a prompt-driven TTS system that enables nuanced emotion and intensity control across multiple speakers, leveraging large language models to enhance expressiveness and variability in synthesized speech.

Contribution

It presents a novel prompt-based architecture for emotion and intensity control in multi-speaker TTS, utilizing LLMs to manipulate prosody while maintaining linguistic accuracy.

Findings

01

Effective emotion and intensity control demonstrated

02

Enhanced speech expressiveness and variability achieved

03

Systematic evaluation confirms control mechanisms' effectiveness

Abstract

Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.