Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions

Xiaoxue Gao; Huayun Zhang; Nancy F. Chen

arXiv:2506.02742·eess.AS·June 4, 2025

Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions

Xiaoxue Gao, Huayun Zhang, Nancy F. Chen

PDF

Open Access

TL;DR

This paper introduces a zero-shot expressive speech synthesis method that uses prompt-guided learning and large language model knowledge to generate diverse and unseen emotional speech styles, enhancing naturalness in human-like interactions.

Contribution

It presents a novel prompt-unseen-emotion (PUE) approach that enables zero-shot generation of diverse emotional speech by leveraging LLM-guided prompt learning and emotion proportion adjustments.

Findings

01

Successfully synthesizes unseen emotional speech in zero-shot settings.

02

Allows flexible control of mixed emotional proportions during inference.

03

Demonstrates improved emotional expressiveness over baseline models.

Abstract

Existing expressive text-to-speech (TTS) systems primarily model a limited set of categorical emotions, whereas human conversations extend far beyond these predefined emotions, making it essential to explore more diverse emotional speech generation for more natural interactions. To bridge this gap, this paper proposes a novel prompt-unseen-emotion (PUE) approach to generate unseen emotional speech via emotion-guided prompt learning. PUE is trained utilizing an LLM-TTS architecture to ensure emotional consistency between categorical emotion-relevant prompts and emotional speech, allowing the model to quantitatively capture different emotion weightings per utterance. During inference, mixed emotional speech can be generated by flexibly adjusting emotion proportions and leveraging LLM contextual knowledge, enabling the model to quantify different emotional styles. Our proposed PUE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSparse Evolutionary Training