Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech
Rui Liu, Bin Liu, Haizhou Li

TL;DR
This paper introduces EmoPP, an emotion-aware prosodic phrasing model for expressive TTS, which accurately captures emotional cues to improve naturalness and expressiveness in synthesized speech.
Contribution
The study proposes a novel emotion-aware prosodic phrasing model, EmoPP, that effectively mines emotional cues to enhance expressive speech synthesis.
Findings
EmoPP outperforms baseline models in objective and subjective evaluations.
Strong correlation between emotion and prosodic phrasing validated on ESD dataset.
Enhanced emotion expressiveness achieved in TTS with EmoPP.
Abstract
Prosodic phrasing is crucial to the naturalness and intelligibility of end-to-end Text-to-Speech (TTS). There exist both linguistic and emotional prosody in natural speech. As the study of prosodic phrasing has been linguistically motivated, prosodic phrasing for expressive emotion rendering has not been well studied. In this paper, we propose an emotion-aware prosodic phrasing model, termed \textit{EmoPP}, to mine the emotional cues of utterance accurately and predict appropriate phrase breaks. We first conduct objective observations on the ESD dataset to validate the strong correlation between emotion and prosodic phrasing. Then the objective and subjective evaluations show that the EmoPP outperforms all baselines and achieves remarkable performance in terms of emotion expressiveness. The audio samples and the code are available at \url{https://github.com/AI-S2-Lab/EmoPP}.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems
