TL;DR
This paper introduces a two-stage prompt selection method for zero-shot speech synthesis that enhances emotional intensity and speaker consistency by evaluating prompts with prosodic, perceptual, and semantic metrics.
Contribution
The proposed static and dynamic prompt selection strategy improves expressive speech synthesis by ensuring stable speaker identity and appropriate emotional cues in zero-shot TTS.
Findings
Enhanced emotional expression in synthesized speech.
Improved speaker similarity and stability.
Effective prompt selection demonstrated through experiments.
Abstract
Recent advancements in speech synthesis have enabled large language model (LLM)-based systems to perform zero-shot generation with controllable content, timbre, speaker identity, and emotion through input prompts. As a result, these models heavily rely on prompt design to guide the generation process. However, existing prompt selection methods often fail to ensure that prompts contain sufficiently stable speaker identity cues and appropriate emotional intensity indicators, which are crucial for expressive speech synthesis. To address this challenge, we propose a two-stage prompt selection strategy specifically designed for expressive speech synthesis. In the static stage (before synthesis), we first evaluate prompt candidates using pitch-based prosodic features, perceptual audio quality, and text-emotion coherence scores evaluated by an LLM. We further assess the candidates under a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
